-
Notifications
You must be signed in to change notification settings - Fork 70
bump universal image to support training hub 0.3.0 #502
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
bump universal image to support training hub 0.3.0 #502
Conversation
WalkthroughUpdated training image Dockerfiles and Tekton pipelines: runtime Dockerfile revises package versions, adds a deterministic two‑step wheel build and universal entrypoint wiring; a new debug Dockerfile adds CUDA/toolchain, RDMA, diagnostics and wheel-build steps; Tekton PipelineRuns gain 9h timeouts; README minor note added. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Builder as Docker build
participant Preflight as Preflight checks
participant Fetch as Binary wheel fetch
participant Build as Two-step wheel build
participant Pip as pip install
participant Image as Final image
rect `#F0F8FF`
Builder->>Preflight: run diagnostics (nvcc, indexes, TORCH_CUDA_ARCH_LIST)
end
rect `#F5FFE6`
Preflight->>Fetch: attempt binary wheel install
Fetch-->>Preflight: success / not available
end
alt binary available
Fetch->>Pip: install binary wheels
else build required
Preflight->>Build: deterministic two-step build (mamba-ssm, permissions, heartbeat, timeout)
Build->>Pip: install locally-built wheels
end
Pip->>Image: finalize (set ENV, COPY entrypoint, set ENTRYPOINT/CMD)
Image-->>Builder: image produced
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
images/universal/training/py312-cuda128-torch280/Dockerfile(4 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Red Hat Konflux / universal-image-py312-cuda128-torch280-on-pull-request
🔇 Additional comments (4)
images/universal/training/py312-cuda128-torch280/Dockerfile (4)
163-168: No issues found; entrypoint script is properly implemented.The
entrypoint-universal.shscript exists and correctly implements the required conditional routing logic. It properly routes between workbench mode (when NOTEBOOK_ARGS is set, falls back to start-notebook.sh) and headless runtime mode (execs provided command), with appropriate error handling when neither is available.
122-122: transformers 4.57.1 is compatible with PyTorch 2.8.0. The library requires torch >= 2.2, so it works with PyTorch 2.8.0. No compatibility issues to address.
139-145: Verify training-hub 0.3.0 is accessible and its dependencies are documented.The training_hub project is available from the Red Hat AI Innovation Team, but web search shows training-hub 0.2.0 as the latest published version on PyPI—training-hub 0.3.0 does not appear to be publicly released. The packages being added (einops 0.8.1, kernels 0.10.3, instructlab-training 0.12.1, rhai-innovation-mini-trainer 0.3.0) should be verified as explicit dependencies of training-hub 0.3.0 rather than assumptions.
Confirm:
- Whether training-hub 0.3.0 is pulled from a development branch or internal repository
- That these packages (especially instructlab-training 0.12.1, which is significantly older than current versions) are intentional and compatible with the training-hub 0.3.0 target
157-160: Remove--no-depsor switch to documented installation method for mamba-ssm.The 2-step installation pattern (causal-conv1d first, then mamba-ssm) aligns with official documentation. However, using
--no-depson mamba-ssm deviates from the documented procedure, which instead recommends:
pip install mamba-ssm==2.2.6.post3(without--no-deps), orpip install "mamba-ssm[causal-conv1d]==2.2.6.post3"(using extras syntax).The
--no-depsflag is not mentioned in the official installation instructions, and it assumes all transitive dependencies are already installed in the earlier pip block. This assumption is not explicit and could silently omit critical packages.Recommendation: Either remove
--no-depsto let pip resolve dependencies automatically, or test the container build to confirm all runtime dependencies are satisfied.
a5aa67d to
3416b2c
Compare
| # Install remaining runtime packages (resolved from default PyPI), including FlashAttention | ||
| # Note: We intentionally do not use a Pipfile/lock here to avoid mixing resolvers with the base (uv lock), | ||
| # to control CUDA/FA install order and indexes, and to reduce lock churn across arches/ABI-specific wheels. | ||
| RUN pip install --retries 5 --timeout 300 --no-cache-dir \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly in a separate PR, should we start adding kubeflow-sdk and kfp?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kubeflow-sdk is here - again, this is temp, needs to be updated.
I'll add kfp, the imminent 2.15 release has some functionality in it that we will want, so might need to wait for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
images/universal/training/py312-cuda128-torch280/Dockerfile (1)
157-160: Document rationale for deterministic two-step install approach.Lines 157–160 implement a specific install sequence for
causal-conv1dandmamba-ssmwith--no-build-isolationand--no-depsflags. While this pattern may align with runtime behavior, add a comment explaining:
- Why
causal-conv1dmust be installed first- Why
mamba-ssmrequires--no-deps- Whether this order prevents dependency conflicts or resolves specific build issues
~# Deterministic 2-step: sub-dep first, then parent without deps (align with runtime) +# Deterministic 2-step: sub-dep first, then parent without deps (align with runtime) +# causal-conv1d must be installed before mamba-ssm to ensure it is available during mamba-ssm build. +# mamba-ssm uses --no-deps to prevent pip from installing a conflicting version of causal-conv1d. RUN pip install --retries 5 --timeout 300 --no-cache-dir --no-build-isolation causal-conv1d==1.5.3.post1 && \ pip install --retries 5 --timeout 300 --no-cache-dir --no-build-isolation mamba-ssm==2.2.6.post3 --no-deps && \ fix-permissions /opt/app-root -P
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
images/universal/training/py312-cuda128-torch280/Dockerfile(3 hunks)
🔇 Additional comments (2)
images/universal/training/py312-cuda128-torch280/Dockerfile (2)
163-168: Entrypoint script verified and correctly configured.The script exists at the specified path, contains a valid shebang, and will be properly executable in the container via the
--chmod=0755directive in the COPY command. The container will start correctly.
22-22: No issues found after verification.The required files (LICENSE.md and entrypoint-universal.sh) exist in the repository. Package versions are available on PyPI: training-hub 0.3.0 is published and compatible, bitsandbytes 0.48.1 includes CUDA 12.8 compatible wheels, and all dependency versions are properly specified.
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: astefanutti The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Thanks! |
|
New changes are detected. LGTM label has been removed. |
acaff12 to
e24db03
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
images/universal/training/py312-cuda128-torch280/Dockerfile (1)
137-137: Upgrade bitsandbytes to 0.48.2 — version 0.48.1 lacks CUDA 12.8 wheelsbitsandbytes 0.48.1 has no CUDA 12.8 wheels available on PyPI (0 cu128 wheels found). This will cause runtime failures when the container attempts to load
libbitsandbytes_cuda128. Upgrade to bitsandbytes 0.48.2, which has confirmed cu128 support.- bitsandbytes==0.48.1 \ + bitsandbytes==0.48.2 \
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
images/universal/training/py312-cuda128-torch280/Dockerfile(3 hunks)
🔇 Additional comments (2)
images/universal/training/py312-cuda128-torch280/Dockerfile (2)
163-163: No issues found. All required files exist and are correctly configured.Both
LICENSE.mdandentrypoint-universal.shexist in the repository. Theentrypoint-universal.shscript correctly implements the intended behavior:
- Default mode (no args): runs
start-notebook.shvia the entrypoint wrapper- Headless mode (with command): the wrapper executes the provided command
The ENTRYPOINT and CMD configuration at lines 167-168 properly support this dual-mode behavior.
139-140: I need to check what distribution files are actually available for each package, since the absence of cp312 wheels doesn't necessarily mean they won't work.The web search results reveal important version discrepancies and accessibility issues. Let me verify the kernels package version specifically:
All new packages are compatible with Python 3.12; review concern is not valid.
einops 0.8.1 has a universal wheel (einops-0.8.1-py3-none-any.whl), and kernels version 0.10.3 exists on PyPI (released Oct 13, 2025) with Python >=3.9 requirements. Both packages use universal wheels (py3-none-any format) which are compatible with Python 3.12—specific cp312 wheels are unnecessary for pure Python packages. instructlab-training supports Python 3.12. The script's finding of "0 Python 3.12 wheels" reflects the absence of version-specific binary wheels, not actual incompatibility; universal wheels work across all Python 3.x versions.
| # Deterministic 2-step: sub-dep first, then parent without deps (align with runtime) | ||
| RUN pip install --retries 5 --timeout 300 --no-cache-dir --no-build-isolation causal-conv1d==1.5.3.post1 && \ | ||
| pip install --retries 5 --timeout 300 --no-cache-dir --no-build-isolation mamba-ssm==2.2.6.post3 --no-deps && \ | ||
| fix-permissions /opt/app-root -P |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Clarify the rationale and verify completeness of the 2-step deterministic installation.
The 2-step installation of causal-conv1d followed by mamba-ssm with --no-deps assumes that causal-conv1d satisfies all of mamba-ssm's dependencies. While this pattern can improve determinism, it bypasses pip's dependency resolution, which carries risk if mamba-ssm has other undeclared dependencies.
Verify that causal-conv1d 1.5.3.post1 is the only missing dependency of mamba-ssm 2.2.6.post3:
Also document in a comment why this explicit 2-step approach is necessary (e.g., "avoids pip resolver conflicts with training-hub" or similar).
🏁 Script executed:
#!/bin/bash
# Check mamba-ssm dependency metadata
echo "=== Checking mamba-ssm 2.2.6.post3 metadata ==="
curl -s "https://pypi.org/pypi/mamba-ssm/2.2.6.post3/json" | \
python3 -c "
import sys, json
data = json.load(sys.stdin)
info = data.get('info', {})
print('Requires:')
requires = info.get('requires_dist', [])
if requires:
for req in requires:
print(f' - {req}')
else:
print(' (no metadata available)')
"
# Also verify causal-conv1d version
echo ""
echo "=== Checking causal-conv1d 1.5.3.post1 availability ==="
curl -s "https://pypi.org/pypi/causal-conv1d/1.5.3.post1/json" | \
python3 -c "
import sys, json
try:
data = json.load(sys.stdin)
print('✓ causal-conv1d 1.5.3.post1 exists')
except:
print('✗ causal-conv1d 1.5.3.post1 not found')
"Length of output: 175
Remove the --no-deps flag and declare causal-conv1d explicitly in project requirements instead.
causal-conv1d is listed as an optional component for mamba-ssm, not a required dependency. Using --no-deps bypasses pip's dependency validation, which prevents verification that the installed versions are compatible. If causal-conv1d is not installed, mamba-ssm falls back to an eager implementation that is slower but functional.
Replace the 2-step pattern with a straightforward install:
RUN pip install --retries 5 --timeout 300 --no-cache-dir --no-build-isolation causal-conv1d==1.5.3.post1 mamba-ssm==2.2.6.post3
Or declare causal-conv1d==1.5.3.post1 in your training-hub requirements file so pip resolves both transitively with proper version constraints.
🤖 Prompt for AI Agents
In images/universal/training/py312-cuda128-torch280/Dockerfile around lines 157
to 160, the RUN uses pip install mamba-ssm with --no-deps which bypasses
dependency resolution and prevents pip verifying compatibility with optional
causal-conv1d; remove the --no-deps and either install both packages together in
one pip invocation (e.g., pip install causal-conv1d==1.5.3.post1
mamba-ssm==2.2.6.post3) or add causal-conv1d==1.5.3.post1 to the project
training-hub requirements so pip can resolve versions transitively and ensure
compatibility.
|
/retest |
da73149 to
9793396
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (1)
images/universal/training/py312-cuda128-torch280/Dockerfile (1)
73-76: Avoid referencing environment variables within the sameENVstatement (Hadolint DL3044).While Docker expands variables in this context, it is clearer and follows Dockerfile best practices to split the definition into separate statements.
Apply this diff to separate the ENV statements:
-ENV CUDA_HOME=/usr/local/cuda \ - PATH=/usr/local/nvidia/bin:${CUDA_HOME}/bin:${PATH} \ - LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:$CUDA_HOME/lib64:$CUDA_HOME/extras/CUPTI/lib64:$LD_LIBRARY_PATH \ - TORCH_CUDA_ARCH_LIST="8.0;8.6;8.9;9.0" +ENV CUDA_HOME=/usr/local/cuda + +ENV PATH=/usr/local/nvidia/bin:${CUDA_HOME}/bin:${PATH} \ + LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:${CUDA_HOME}/lib64:${CUDA_HOME}/extras/CUPTI/lib64:${LD_LIBRARY_PATH} \ + TORCH_CUDA_ARCH_LIST="8.0;8.6;8.9;9.0"
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
.tekton/universal-image-py312-cuda128-torch280-pull-request.yaml(1 hunks).tekton/universal-image-py312-cuda128-torch280-push.yaml(1 hunks)images/universal/training/py312-cuda128-torch280/Dockerfile(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- .tekton/universal-image-py312-cuda128-torch280-push.yaml
- .tekton/universal-image-py312-cuda128-torch280-pull-request.yaml
🧰 Additional context used
🪛 Hadolint (2.14.0)
images/universal/training/py312-cuda128-torch280/Dockerfile
[error] 73-73: Do not refer to an environment variable within the same ENV statement where it is defined.
(DL3044)
[error] 124-124: Multiple redirections compete for stdout. Use cat, tee, or pass filenames instead.
(SC2261)
🔇 Additional comments (2)
images/universal/training/py312-cuda128-torch280/Dockerfile (2)
196-201: LGTM: Entrypoint wrapper and dual-mode behavior.The addition of the entrypoint wrapper and dual-mode setup (workbench by default, headless when args provided) aligns well with the design intent documented in the header. This allows the image to work both as an interactive Jupyter environment and as a runtime container for training jobs.
185-193: Verify whether--no-depsuse on line 189 is an intentional design trade-off or unaddressed issue.The review comment accurately identifies that
--no-depsis present on line 189 and correctly explains the technical concern: it bypasses pip's dependency validation. However, the Dockerfile comments (lines 183–184) indicate a deliberate "deterministic 2-step build" design, which may intentionally use--no-depsto prevent re-resolution during the wheel build phase.Before accepting the recommendation to remove
--no-deps, confirm:
- Whether this issue was actually flagged and deferred in a prior review
- Whether the
--no-depsapproach is a known workaround documented in your project's build design- Whether removing
--no-depsaligns with the stated goal of deterministic buildsIf
--no-depswas an unresolved concern from a prior review and the team decided it should be removed, the review comment's recommendation is sound. If it's an intentional design decision, the concern may need to be reconsidered differently (e.g., adding explicit dependency validation in CI).
| && echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf \ | ||
| && echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf \ | ||
| && dnf clean all \ | ||
| && rm -rf /var/cache/dnf/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add ldconfig call after updating /etc/ld.so.conf.d/nvidia.conf.
Writing to ldconfig config files does not take effect until ldconfig is run to rebuild the cache. Without this step, the CUDA libraries at /usr/local/nvidia/lib and /usr/local/nvidia/lib64 will not be discoverable during the preflight check on line 82 or at runtime.
Apply this diff to add the ldconfig cache update:
&& echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf \
&& echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf \
+ && ldconfig \
&& dnf clean all \📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| && echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf \ | |
| && echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf \ | |
| && dnf clean all \ | |
| && rm -rf /var/cache/dnf/* | |
| && echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf \ | |
| && echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf \ | |
| && ldconfig \ | |
| && dnf clean all \ | |
| && rm -rf /var/cache/dnf/* |
🤖 Prompt for AI Agents
In images/universal/training/py312-cuda128-torch280/Dockerfile around lines
67-70, after appending /usr/local/nvidia/lib and /usr/local/nvidia/lib64 to
/etc/ld.so.conf.d/nvidia.conf you must run ldconfig to rebuild the dynamic
linker cache so the CUDA libraries become discoverable; add a call to ldconfig
immediately after the echo lines and before the dnf clean all / rm -rf
/var/cache/dnf/* steps.
9793396 to
ad676de
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (3)
images/universal/training/py312-cuda128-torch280/Dockerfile (3)
67-70: Missingldconfigcall—unresolved critical issue from prior review.Lines 67–68 append CUDA library paths to
/etc/ld.so.conf.d/nvidia.conf, but the dynamic linker cache is not rebuilt. Theldconfig -pcheck on line 82 comes too late and does not rebuild the cache. Without an explicitldconfigcall immediately after writing the config, CUDA libraries at/usr/local/nvidia/liband/usr/local/nvidia/lib64remain undiscoverable during preflight checks and at runtime.Apply this diff to rebuild the cache immediately:
&& echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf \ && echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf \ + && ldconfig \ && dnf clean all \
142-142: Upgrade bitsandbytes to 0.48.2 or later—unresolved critical issue from prior review.Line 142 pins bitsandbytes to 0.48.1, which lacks official CUDA 12.8 wheels. Versions 0.48.2 and later include published wheels with proper CUDA 12.8 support. Without this upgrade, the image will fail at runtime when bitsandbytes attempts to load
libbitsandbytes_cuda128.Apply this diff:
- bitsandbytes==0.48.1 \ + bitsandbytes==0.48.2 \
183-193: Remove--no-depsto enable dependency validation—unresolved critical issue from prior review.Line 189 uses
--no-depswhen building mamba-ssm, which bypasses pip's dependency resolution and prevents verification that causal-conv1d and mamba-ssm versions are compatible. While the 2-step deterministic build with verbose logging is good for debugging, using--no-depsmasks potential missing or incompatible dependencies.Apply this diff to allow pip to validate the dependency chain:
- pip wheel --no-cache-dir --no-build-isolation -vv \ - --log /tmp/pip-mamba-ssm.log \ - mamba-ssm==2.2.6.post3 --no-deps -w /tmp/wheels || \ + pip wheel --no-cache-dir --no-build-isolation -vv \ + --log /tmp/pip-mamba-ssm.log \ + causal-conv1d==1.5.3.post1 mamba-ssm==2.2.6.post3 -w /tmp/wheels || \Then update line 191 accordingly:
- pip install --no-cache-dir /tmp/wheels/*.whl && \ + pip install --no-cache-dir /tmp/wheels/mamba*.whl && \Alternatively, add
causal-conv1d==1.5.3.post1to your training-hub requirements so pip resolves both transitively in a single install step.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
.tekton/universal-image-py312-cuda128-torch280-pull-request.yaml(1 hunks).tekton/universal-image-py312-cuda128-torch280-push.yaml(1 hunks)images/universal/training/py312-cuda128-torch280/Dockerfile(4 hunks)images/universal/training/py312-cuda128-torch280/README.md(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- .tekton/universal-image-py312-cuda128-torch280-push.yaml
- .tekton/universal-image-py312-cuda128-torch280-pull-request.yaml
🧰 Additional context used
🪛 Hadolint (2.14.0)
images/universal/training/py312-cuda128-torch280/Dockerfile
[error] 73-73: Do not refer to an environment variable within the same ENV statement where it is defined.
(DL3044)
[error] 124-124: Multiple redirections compete for stdout. Use cat, tee, or pass filenames instead.
(SC2261)
🔇 Additional comments (5)
images/universal/training/py312-cuda128-torch280/README.md (1)
7-7: Documentation update is accurate.The addition of "Minimal Workbench" correctly reflects the base image used in the Dockerfile (line 14 of the Dockerfile).
images/universal/training/py312-cuda128-torch280/Dockerfile (4)
119-119: Build tooling upgrades are appropriate.The modern build tooling installation (pip, setuptools, wheel, ninja, cmake) ensures proper extension builds for the downstream packages.
162-181: CUDA user-space libraries via pip is well-structured.The dedicated pip block for NVIDIA CUDA libraries (lines 162–178) and the corresponding
LD_LIBRARY_PATHupdate (line 181) provide proper isolation and reproducibility for extension builds that depend on cuDNN and other CUDA runtime libraries.
102-109: Torch installation and preflight checks are well-structured.Separating the PyTorch cu128 install on line 102 from the default PyPI index prevents index resolution conflicts. The inline CUDA validation on lines 105–109 provides useful fail-fast feedback.
195-201: Entrypoint and CMD design supports both workbench and headless modes.The entrypoint wrapper (line 196) with default CMD (line 201) allows the image to preserve workbench behavior by default while enabling headless runtime when a command is provided. This aligns well with the PR objective to support Training Hub 0.3.0 runtime scenarios.
d71708d to
9b73c61
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (4)
images/universal/training/py312-cuda128-torch280/Dockerfile (4)
67-70:⚠️ [Unresolved] Missingldconfigcall after appending to/etc/ld.so.conf.d/nvidia.conf.Writing to ldconfig config files does not take effect until
ldconfigis run to rebuild the cache. The current code appends to/etc/ld.so.conf.d/nvidia.confbut does not callldconfig, so the CUDA libraries at/usr/local/nvidia/liband/usr/local/nvidia/lib64will not be discoverable during the runtime checks on line 82 or at runtime. This was flagged in a prior review and remains unresolved.Apply this diff to add the missing
ldconfigcall:&& echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf \ && echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf \ + && ldconfig \ && dnf clean all \
73-76:⚠️ [Unresolved] ENV statement violates Hadolint DL3044 with self-referencing variables.Lines 74–75 reference
${PATH}and${LD_LIBRARY_PATH}within the sameENVstatement where they are being defined. This violates Docker best practices (Hadolint DL3044) and may cause unexpected behavior. This issue was flagged in a prior review and remains unresolved.Refactor into separate
ENVstatements to allow prepending to existing values:- ENV CUDA_HOME=/usr/local/cuda \ - PATH=/usr/local/nvidia/bin:${CUDA_HOME}/bin:${PATH} \ - LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:$CUDA_HOME/lib64:$CUDA_HOME/extras/CUPTI/lib64:$LD_LIBRARY_PATH \ - TORCH_CUDA_ARCH_LIST="8.0;8.6;8.9;9.0" + ENV CUDA_HOME=/usr/local/cuda \ + TORCH_CUDA_ARCH_LIST="8.0;8.6;8.9;9.0" + ENV PATH=/usr/local/nvidia/bin:${CUDA_HOME}/bin:${PATH} + ENV LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:${CUDA_HOME}/lib64:${CUDA_HOME}/extras/CUPTI/lib64:${LD_LIBRARY_PATH}
150-150:⚠️ [Unresolved] Upgrade bitsandbytes from 0.48.1 to 0.48.2 or later for CUDA 12.8 wheel support.Line 150 pins
bitsandbytes==0.48.1, which lacks official CUDA 12.8 wheels. Bitsandbytes 0.48.2 and later include published wheels targeting CUDA 12.8, resolving runtime failures when loadinglibbitsandbytes_cuda128. This critical issue was flagged in a prior review and remains unresolved.Apply this diff to upgrade the package:
- bitsandbytes==0.48.1 \ + bitsandbytes==0.48.2 \
191-201:⚠️ [Unresolved] mamba-ssm build still uses--no-deps, bypassing dependency validation.Line 197 continues to use
--no-depswhen building the mamba-ssm wheel, which bypasses pip's dependency resolution and prevents verification that optional dependencies (like causal-conv1d) are compatible with the installed version. This critical issue was flagged in a prior review and remains unresolved.While the 2-step deterministic build pattern (build wheel, then install from local wheel) is a reasonable approach for reproducibility and debugging, the
--no-depsflag introduces risk.Recommended fix: Install both packages together in a single pip invocation to let pip resolve compatibility:
- RUN python -m pip install --no-cache-dir --no-build-isolation -vv --log /tmp/pip-causal.log causal-conv1d==1.5.3.post1 && \ - CMAKE_GENERATOR=Ninja CMAKE_ARGS="-DCMAKE_VERBOSE_MAKEFILE=ON" \ - python -m pip wheel --no-cache-dir --no-build-isolation -vv \ - --log /tmp/pip-mamba-ssm.log \ - mamba-ssm==2.2.6.post3 --no-deps -w /tmp/wheels || \ - (echo '--- pip mamba-ssm log (tail) ---'; tail -n 500 /tmp/pip-mamba-ssm.log; exit 1) && \ - python -m pip install --no-cache-dir /tmp/wheels/*.whl && \ - rm -rf /tmp/wheels && \ - fix-permissions /opt/app-root -P + RUN CMAKE_GENERATOR=Ninja CMAKE_ARGS="-DCMAKE_VERBOSE_MAKEFILE=ON" \ + python -m pip install --retries 5 --timeout 300 --no-cache-dir --no-build-isolation \ + causal-conv1d==1.5.3.post1 mamba-ssm==2.2.6.post3 && \ + fix-permissions /opt/app-root -PAlternatively, if deterministic wheel builds are essential, remove
--no-depsto allow pip to validate both packages together before installation.
🧹 Nitpick comments (1)
images/universal/training/py312-cuda128-torch280/Dockerfile (1)
108-110: Consider removing or conditionally gating diagnostic print statements.Lines 108–110 print Python interpreter details and torch metadata during the build. These lines are useful for build-time debugging but are unusual in a production Dockerfile. Consider either removing them, moving them into a separate build stage, or gating them behind a build ARG for optional diagnostic mode.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
.tekton/universal-image-py312-cuda128-torch280-pull-request.yaml(1 hunks).tekton/universal-image-py312-cuda128-torch280-push.yaml(1 hunks)images/universal/training/py312-cuda128-torch280/Dockerfile(6 hunks)images/universal/training/py312-cuda128-torch280/README.md(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- images/universal/training/py312-cuda128-torch280/README.md
- .tekton/universal-image-py312-cuda128-torch280-pull-request.yaml
- .tekton/universal-image-py312-cuda128-torch280-push.yaml
🧰 Additional context used
🪛 Hadolint (2.14.0)
images/universal/training/py312-cuda128-torch280/Dockerfile
[error] 73-73: Do not refer to an environment variable within the same ENV statement where it is defined.
(DL3044)
[error] 132-132: Multiple redirections compete for stdout. Use cat, tee, or pass filenames instead.
(SC2261)
🔇 Additional comments (5)
images/universal/training/py312-cuda128-torch280/Dockerfile (5)
95-96: Python user-site configuration is well configured.Setting
PYTHONUSERBASEandPYTHONNOUSERSITEprovides clear control over user package installs and discovery paths, which aligns with the containerized environment goals.
106-106: Good adoption ofpython -m pipfor improved determinism and consistency.The conversion from
piptopython -m pipthroughout the file is a positive change that improves module import consistency and reduces ambiguity about which Python is being used. This is a best practice for container builds.Also applies to: 120-120, 123-124, 127-127
171-186: NVIDIA CUDA user-space libraries via pip are well selected and versioned.The addition of NVIDIA user-space CUDA libraries (nccl, cublas, cudnn, cufft, curand, cusolver, cusparse, nvjitlink, nvtx, etc.) provides the necessary runtime components for compiled extensions. Version pinning ensures reproducibility.
189-189: cuDNN path addition to LD_LIBRARY_PATH is appropriate.Adding
/opt/app-root/lib/python3.12/site-packages/nvidia/cudnn/libtoLD_LIBRARY_PATHensures that cuDNN libraries installed via pip are discoverable during source builds. This enables consistent behavior between pip-installed and system-installed CUDA libraries.
204-209: ENTRYPOINT and CMD additions enable flexible invocation.The addition of a POSIX entrypoint wrapper (
entrypoint-universal.sh) and explicitENTRYPOINT/CMDdeclarations provide clean separation between workbench mode (default, runs start-notebook.sh) and headless mode (when a command is provided). This aligns well with the design intent documented in the file header.
ff5f03b to
da81998
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (3)
images/universal/training/py312-cuda128-torch280/Dockerfile (3)
150-150: Resolve unaddressed critical issue: upgrade bitsandbytes to 0.48.2+ for CUDA 12.8 support.Line 150 pins bitsandbytes to 0.48.1, which was flagged as a critical issue in prior reviews and remains unresolved. Version 0.48.1 lacks official CUDA 12.8 wheels; the runtime will fail when trying to load
libbitsandbytes_cuda128. A prior review confirmed via web search that bitsandbytes 0.48.2 and later include published wheels targeting CUDA 12.8.Apply this fix:
- bitsandbytes==0.48.1 \ + bitsandbytes==0.48.2 \
67-70: Addldconfigcall to rebuild dynamic linker cache after updatingld.so.conf.d.Lines 67–68 append NVIDIA library paths to
/etc/ld.so.conf.d/nvidia.conf, but the dynamic linker cache is not rebuilt. This was flagged as a major issue in prior reviews. Without callingldconfig, the CUDA libraries at/usr/local/nvidia/liband/usr/local/nvidia/lib64remain undiscoverable, and the preflight check at line 82 may fail.Apply this diff:
&& echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf \ && echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf \ + && ldconfig \ && dnf clean all \
73-76: Split ENV statement to avoid self-referencing variables (Hadolint DL3044).Lines 73–76 define
CUDA_HOME,PATH,LD_LIBRARY_PATH, andTORCH_CUDA_ARCH_LISTin a singleENVblock, withPATHandLD_LIBRARY_PATHself-referencing their own values. This violates Hadolint DL3044 and may behave unexpectedly. This was flagged as a major issue in prior reviews and remains unresolved.Refactor into separate
ENVstatements to allow safe prepending:- ENV CUDA_HOME=/usr/local/cuda \ - PATH=/usr/local/nvidia/bin:${CUDA_HOME}/bin:${PATH} \ - LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:$CUDA_HOME/lib64:$CUDA_HOME/extras/CUPTI/lib64:$LD_LIBRARY_PATH \ - TORCH_CUDA_ARCH_LIST="8.0;8.6;8.9;9.0" + ENV CUDA_HOME=/usr/local/cuda + ENV PATH=/usr/local/nvidia/bin:${CUDA_HOME}/bin:${PATH} + ENV LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:${CUDA_HOME}/lib64:${CUDA_HOME}/extras/CUPTI/lib64:${LD_LIBRARY_PATH} + ENV TORCH_CUDA_ARCH_LIST="8.0;8.6;8.9;9.0"
🧹 Nitpick comments (1)
images/universal/training/py312-cuda128-torch280/Dockerfile (1)
189-199: Consider removing or better-documenting the large commented-out code block.Lines 189–199 contain a commented-out 2-step deterministic build for causal-conv1d and mamba-ssm with verbose logging. If these packages are not planned for immediate use, remove this block to keep the Dockerfile clean. If these are deferred features, add a clearer comment explaining the reason for deferral (e.g., "# TODO: Enable after resolving mamba-ssm --no-deps issue" or "# DEBUG: Disabled pending package availability").
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
.tekton/universal-image-py312-cuda128-torch280-pull-request.yaml(1 hunks).tekton/universal-image-py312-cuda128-torch280-push.yaml(1 hunks)images/universal/training/py312-cuda128-torch280/Dockerfile(6 hunks)images/universal/training/py312-cuda128-torch280/README.md(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- images/universal/training/py312-cuda128-torch280/README.md
🚧 Files skipped from review as they are similar to previous changes (2)
- .tekton/universal-image-py312-cuda128-torch280-pull-request.yaml
- .tekton/universal-image-py312-cuda128-torch280-push.yaml
🧰 Additional context used
🪛 Hadolint (2.14.0)
images/universal/training/py312-cuda128-torch280/Dockerfile
[error] 73-73: Do not refer to an environment variable within the same ENV statement where it is defined.
(DL3044)
[error] 132-132: Multiple redirections compete for stdout. Use cat, tee, or pass filenames instead.
(SC2261)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Red Hat Konflux / universal-image-py312-cuda128-torch280-on-pull-request
🔇 Additional comments (2)
images/universal/training/py312-cuda128-torch280/Dockerfile (2)
95-96: Good additions: modern pip usage, diagnostic support, explicit NVIDIA wheels, and proper entrypoint setup.The following improvements are well-executed:
- Lines 95–96: Python user-site environment variables properly configured.
- Lines 106, 120, 123–124, 127, 132: Switched to
python -m pip(modern best practice).- Lines 108–110: Added diagnostics for Python environment and torch availability.
- Lines 168–184: Explicit installation of NVIDIA CUDA user-space libraries via pip ensures extension builds have necessary headers and libraries.
- Line 187: LD_LIBRARY_PATH extended to include cuDNN from pip, aligning with pip-installed wheels.
- Lines 202–207: ENTRYPOINT and CMD properly set, maintaining design intent.
Also applies to: 106-106, 120-120, 123-124, 127-127, 132-132, 168-184, 187-187, 202-207
202-202: No action required—entrypoint-universal.shexists and is valid.The file at
./images/universal/training/py312-cuda128-torch280/entrypoint-universal.shexists in the repository and contains valid shell script content. The COPY command on line 202 will execute successfully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (4)
images/universal/training/py312-cuda128-torch280/Dockerfile (4)
177-177: Critical: Update bitsandbytes to version 0.48.2 or later (duplicate fix required).This is the second occurrence of
bitsandbytes==0.48.1that must be updated to match the fail-fast check. See the preceding comment for full context.Apply this diff:
- bitsandbytes==0.48.1 \ + bitsandbytes==0.48.2 \
151-152: Critical: Update bitsandbytes to version 0.48.2 or later for CUDA 12.8 support.bitsandbytes 0.48.1 publishes only generic manylinux wheels without CUDA 12.8-specific wheels on PyPI. Since this Dockerfile targets CUDA 12.8, the fail-fast check here will fail during installation, and even if it passes, runtime will fail when attempting to load
libbitsandbytes_cuda128. This issue was previously flagged as critical and marked as resolved, but the version remains unchanged.Apply this diff:
- python -m pip download --retries 5 --timeout 120 --only-binary=:all: --no-deps -d /tmp/wheels \ - numba==0.62.1 bitsandbytes==0.48.1 || \ + python -m pip download --retries 5 --timeout 120 --only-binary=:all: --no-deps -d /tmp/wheels \ + numba==0.62.1 bitsandbytes==0.48.2 || \
68-71: Addldconfigcall after updating/etc/ld.so.conf.d/nvidia.conf.Writing to ldconfig config files does not take effect until
ldconfigis run to rebuild the cache. Without this step, the CUDA libraries at/usr/local/nvidia/liband/usr/local/nvidia/lib64will not be discoverable during the preflight checks on line 88 or at runtime.Apply this diff to add the ldconfig call:
&& echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf \ && echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf \ + && ldconfig \ && dnf clean all \
73-76: Fix environment variable self-reference (Hadolint DL3044).Within a single
ENVstatement, you cannot reference a variable being defined (e.g.,${PATH}and${LD_LIBRARY_PATH}on lines 75-76). This violates Docker best practices. Refactor into separateENVstatements to allow prepending to existing values.Apply this diff to split the ENV block:
- ENV CUDA_HOME=/usr/local/cuda \ - PATH=/usr/local/nvidia/bin:${CUDA_HOME}/bin:${PATH} \ - LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:$CUDA_HOME/lib64:$CUDA_HOME/extras/CUPTI/lib64:$LD_LIBRARY_PATH - # TODO: revisit this, going with default list for now. - # TORCH_CUDA_ARCH_LIST="8.0;8.6;8.9;9.0" + ENV CUDA_HOME=/usr/local/cuda \ + TORCH_CUDA_ARCH_LIST="8.0;8.6;8.9;9.0" + ENV PATH=/usr/local/nvidia/bin:${CUDA_HOME}/bin:${PATH} + ENV LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:${CUDA_HOME}/lib64:${CUDA_HOME}/extras/CUPTI/lib64:${LD_LIBRARY_PATH}Also uncomment
TORCH_CUDA_ARCH_LIST(currently commented on line 78) to enable explicit CUDA architecture targeting for builds.
🧹 Nitpick comments (1)
images/universal/training/py312-cuda128-torch280/Dockerfile (1)
228-264: Clarify rationale for mamba-ssm--no-depspattern.The 2-step installation of
causal-conv1dfollowed bymamba-ssmwith--no-deps(line 241) bypasses pip's dependency resolution. While this can improve determinism and control build order, it prevents pip from verifying compatibility with optional components. Ifcausal-conv1dis missing,mamba-ssmfalls back to a slower eager implementation rather than failing fast.Consider documenting in a comment why this explicit 2-step approach is necessary (e.g., "avoids pip resolver conflicts with training-hub" or similar), or alternatively, declare
causal-conv1d==1.5.3.post1explicitly in the training-hub requirements so pip can resolve both transitively with proper version constraints.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
.tekton/universal-image-py312-cuda128-torch280-pull-request.yaml(1 hunks).tekton/universal-image-py312-cuda128-torch280-push.yaml(1 hunks)images/universal/training/py312-cuda128-torch280/Dockerfile(6 hunks)images/universal/training/py312-cuda128-torch280/README.md(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- .tekton/universal-image-py312-cuda128-torch280-pull-request.yaml
🚧 Files skipped from review as they are similar to previous changes (2)
- .tekton/universal-image-py312-cuda128-torch280-push.yaml
- images/universal/training/py312-cuda128-torch280/README.md
🧰 Additional context used
🪛 Hadolint (2.14.0)
images/universal/training/py312-cuda128-torch280/Dockerfile
[error] 234-234: unexpected 's'
expecting '#', '', ADD, ARG, CMD, COPY, ENTRYPOINT, ENV, EXPOSE, FROM, HEALTHCHECK, LABEL, MAINTAINER, ONBUILD, RUN, SHELL, STOPSIGNAL, USER, VOLUME, WORKDIR, a pragma, at least one space, or end of input
(DL1000)
🔇 Additional comments (3)
images/universal/training/py312-cuda128-torch280/Dockerfile (3)
22-22: License copy and pip configuration look good.The LICENSE.md file has been verified to exist in the repository, and the new pip environment variables (PIP_DISABLE_PIP_VERSION_CHECK, PIP_UPGRADE_STRATEGY) are appropriate additions for deterministic builds.
Also applies to: 33-34
232-260: Address Hadolint DL1000 syntax error in mamba-ssm heredoc.Static analysis reports an "unexpected 's'" error at line 234, suggesting potential issues with the complex quoting in the heredoc structure. While the syntax may be valid, the nested quoting pattern
'cat > file <<'\''BASH'\''is difficult to parse and maintain.Consider refactoring for clarity. One approach is to avoid the complex quote escaping by using a different pattern:
- bash -lc 'cat > /tmp/build_mamba_wheel.sh <<'\''BASH'\'' + bash -c 'cat > /tmp/build_mamba_wheel.sh << "BASH"'This eliminates the escaped-quote pattern. Verify that the refactored version produces the same heredoc behavior and that Hadolint validation passes.
205-224: NVIDIA CUDA packages and entrypoint setup look good.The comprehensive set of NVIDIA CUDA user-space packages (lines 205-221) are properly versioned and include cuDNN, NCCL, and build libraries. The LD_LIBRARY_PATH update on line 224 correctly points to pip-installed cuDNN. The entrypoint wrapper pattern (lines 266-272) provides sensible flexibility between workbench (default) and headless runtime modes.
Also applies to: 266-272
|
/retest |
ad29950 to
7553cf9
Compare
|
/retest |
7553cf9 to
8927298
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
images/universal/training/py312-cuda128-torch280/Dockerfile (2)
68-71: Addldconfigcall to make /etc/ld.so.conf.d/nvidia.conf paths discoverable.Add a file with the .conf extension to /etc/ld.so.conf.d/ that contains the paths to the libraries and then run ldconfig. Modifying
/etc/ld.so.conf.d/nvidia.confalone does not take effect untilldconfigis invoked to rebuild the dynamic linker cache. The preflight check at line 88 (ldconfig -p | grep) will fail if ldconfig is not run after the echo commands.&& echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf \ && echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf \ + && ldconfig \ && dnf clean all \
74-76: Split ENV statements to resolve self-reference violation (Hadolint DL3044).Docker will not expand a variable within the same ENV statement where it is defined. While it will not crash Docker, it carries a high likelihood of errors. Lines 74-76 reference
${PATH}and${LD_LIBRARY_PATH}in the sameENVstatement where they are being prepended, violating Hadolint's DL3044 rule.Refactor into separate
ENVstatements:- ENV CUDA_HOME=/usr/local/cuda \ - PATH=/usr/local/nvidia/bin:${CUDA_HOME}/bin:${PATH} \ - LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:$CUDA_HOME/lib64:$CUDA_HOME/extras/CUPTI/lib64:$LD_LIBRARY_PATH - # TODO: revisit this, going with default list for now. - # TORCH_CUDA_ARCH_LIST="8.0;8.6;8.9;9.0" + ENV CUDA_HOME=/usr/local/cuda + ENV PATH=/usr/local/nvidia/bin:${CUDA_HOME}/bin:${PATH} + ENV LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:${CUDA_HOME}/lib64:${CUDA_HOME}/extras/CUPTI/lib64:${LD_LIBRARY_PATH}Also note: the TODO comment about
TORCH_CUDA_ARCH_LISTwas removed; ensure this is intentional and the default Torch compile list is acceptable for your GPUs.
🧹 Nitpick comments (1)
images/universal/training/py312-cuda128-torch280/Dockerfile (1)
228-282: Simplify or document the mamba-ssm wheel build logic.Lines 228-282 implement a complex heredoc-based shell script embedded in a single RUN instruction to build the mamba-ssm wheel with timeout and heartbeat monitoring. While the intent (prevent hangs, log progress) is sound, the 50+ lines of inline shell logic make the Dockerfile harder to maintain. Consider:
- Extract to a separate build script in the repository (e.g.,
build-mamba-ssm.sh) and COPY it into the image before running it. This improves readability and allows unit testing of the script.- Add a brief comment explaining why the 2-step causal-conv1d + mamba-ssm approach is necessary (e.g., "Mamba-SSM build takes >10 minutes; timeout ensures the process completes or fails safely without hanging the build").
- Verify the
--no-depsflag on line 280 is intentional—are there any transitive dependencies of mamba-ssm beyond causal-conv1d that should be installed?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
.tekton/universal-image-py312-cuda128-torch280-pull-request.yaml(1 hunks).tekton/universal-image-py312-cuda128-torch280-push.yaml(1 hunks)images/universal/training/py312-cuda128-torch280/Dockerfile(6 hunks)images/universal/training/py312-cuda128-torch280/README.md(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- .tekton/universal-image-py312-cuda128-torch280-pull-request.yaml
- .tekton/universal-image-py312-cuda128-torch280-push.yaml
- images/universal/training/py312-cuda128-torch280/README.md
🧰 Additional context used
🪛 Hadolint (2.14.0)
images/universal/training/py312-cuda128-torch280/Dockerfile
[error] 74-74: Do not refer to an environment variable within the same ENV statement where it is defined.
(DL3044)
[error] 157-157: Multiple redirections compete for stdout. Use cat, tee, or pass filenames instead.
(SC2261)
🔇 Additional comments (3)
images/universal/training/py312-cuda128-torch280/Dockerfile (3)
22-22: Verify LICENSE.md file exists in repository.Line 22 copies
./images/universal/training/py312-cuda128-torch280/LICENSE.mdinto the image. Ensure this file exists in the repository, otherwise the Docker build will fail with a COPY error. If the CUDA license should not be bundled, remove this instruction.
285-285: Verify entrypoint-universal.sh file exists in repository.Line 285 copies
./images/universal/training/py312-cuda128-torch280/entrypoint-universal.shto the image. Ensure this script exists in the repository and is executable. If this file is missing, the Docker build will fail.
151-152: No version upgrade needed; bitsandbytes 0.48.1 already supports CUDA 12.8.bitsandbytes 0.48.1 and 0.48.2 both have prebuilt wheels covering CUDA 12.6 and 12.8, so the suggested upgrade from 0.48.1 to 0.48.2 is not justified. The claim that 0.48.1 "publishes only generic manylinux wheels without CUDA version tagging" is incorrect—the published wheels cover CUDA 12.8 through the manylinux wheels on PyPI and GitHub. The fail-fast check at line 151 will succeed as intended.
All supporting files (LICENSE.md and entrypoint-universal.sh) exist in the repository. The ldconfig call does occur in the build process (line 88), though not immediately after writing to /etc/ld.so.conf.d/nvidia.conf (lines 68–69), which is acceptable across separate RUN layers.
Likely an incorrect or invalid review comment.
8927298 to
02c9a56
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (3)
images/universal/training/py312-cuda128-torch280/Dockerfile (3)
151-151: Update bitsandbytes to 0.48.2 or later (critical, unresolved from prior reviews).Line 151 still pins
bitsandbytes==0.48.1, which lacks official CUDA 12.8 wheels on PyPI. This was flagged as critical in prior reviews and will cause runtime failures when attempting to loadlibbitsandbytes_cuda128. Upgrade to 0.48.2 or later for cu128 support:- numba==0.62.1 bitsandbytes==0.48.1 || \ + numba==0.62.1 bitsandbytes==0.48.2 || \Also update line 177:
- bitsandbytes==0.48.1 \ + bitsandbytes==0.48.2 \
68-71: Missingldconfigcall after updating/etc/ld.so.conf.d/nvidia.conf.Writing to ldconfig config files requires rebuilding the dynamic linker cache for the paths to become discoverable. The preflight check on line 88 (
ldconfig -p | grep -E) will fail without this step.Apply this diff to add the ldconfig call:
&& echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf \ && echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf \ + && ldconfig \ && dnf clean all \
74-76: Fix Hadolint DL3044: split self-referencing ENV variables into separate statements.Referencing
${PATH}and${LD_LIBRARY_PATH}within the sameENVblock where they are being defined violates Docker best practices. Split these into separateENVstatements:- ENV CUDA_HOME=/usr/local/cuda \ - PATH=/usr/local/nvidia/bin:${CUDA_HOME}/bin:${PATH} \ - LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:$CUDA_HOME/lib64:$CUDA_HOME/extras/CUPTI/lib64:$LD_LIBRARY_PATH - # TODO: revisit this, going with default list for now. - # TORCH_CUDA_ARCH_LIST="8.0;8.6;8.9;9.0" + ENV CUDA_HOME=/usr/local/cuda + ENV PATH=/usr/local/nvidia/bin:${CUDA_HOME}/bin:${PATH} + ENV LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:${CUDA_HOME}/lib64:${CUDA_HOME}/extras/CUPTI/lib64:${LD_LIBRARY_PATH}
🧹 Nitpick comments (1)
images/universal/training/py312-cuda128-torch280/Dockerfile (1)
228-285: Consider removing--no-depsfrom mamba-ssm and installing both packages together.Line 243 uses
--no-depsto skip dependency validation, relying on causal-conv1d being pre-installed on line 228. While this works because causal-conv1d is indeed installed first, it bypasses pip's dependency resolver, preventing verification of version compatibility.The complex shell wrapper adds valuable diagnostics and timeout handling, but the underlying pattern could be improved. Consider this alternative:
- RUN python -m pip install --no-cache-dir --no-build-isolation -vv --log /tmp/pip-causal.log causal-conv1d==1.5.3.post1 && \ + RUN python -m pip install --no-cache-dir --no-build-isolation causal-conv1d==1.5.3.post1 && \ [shell script with diagnostics/timeout] && \ - python -m pip install --no-cache-dir /tmp/wheels/*.whl && \ + python -m pip install --no-cache-dir /tmp/wheels/*.whl --no-deps && \Or better yet: add
causal-conv1d==1.5.3.post1to your training-hub requirements so pip resolves both transitively without--no-deps.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (2)
images/runtime/.DS_Storeis excluded by!**/.DS_Storeimages/runtime/training/.DS_Storeis excluded by!**/.DS_Store
📒 Files selected for processing (4)
.tekton/universal-image-py312-cuda128-torch280-pull-request.yaml(1 hunks).tekton/universal-image-py312-cuda128-torch280-push.yaml(1 hunks)images/universal/training/py312-cuda128-torch280/Dockerfile(6 hunks)images/universal/training/py312-cuda128-torch280/README.md(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- .tekton/universal-image-py312-cuda128-torch280-push.yaml
- .tekton/universal-image-py312-cuda128-torch280-pull-request.yaml
- images/universal/training/py312-cuda128-torch280/README.md
🧰 Additional context used
🪛 Hadolint (2.14.0)
images/universal/training/py312-cuda128-torch280/Dockerfile
[error] 74-74: Do not refer to an environment variable within the same ENV statement where it is defined.
(DL3044)
[error] 157-157: Multiple redirections compete for stdout. Use cat, tee, or pass filenames instead.
(SC2261)
🔇 Additional comments (6)
images/universal/training/py312-cuda128-torch280/Dockerfile (6)
33-34: Good: PIP environment variables improve build determinism.The addition of
PIP_UPGRADE_STRATEGY=only-if-needed(line 34) and other pip environment settings promote reproducibility and control over dependency resolution.
58-60: Good: CUDA toolchain, build environment, and user configuration.The additions of CUDA development packages (cuda-libraries-devel-12-8, cuda-compiler-12-8) on lines 58-60, build environment variables (USE_NINJA, TORCH_CUDA_VERBOSE_BUILD) on lines 81-83, and proper PYTHONUSERBASE configuration on lines 107-108 improve the build environment and extension compatibility.
Also applies to: 81-83, 107-108
96-99: Good: Preflight diagnostics improve debuggability.Adding diagnostic output for pip, compiler versions, and nvcc early in the build helps quickly identify environment issues before expensive package installations.
206-221: Good: NVIDIA CUDA pip packages and cuDNN path for extension builds.Installing NVIDIA CUDA pip packages (nvidia-cudnn-cu12, nvidia-cublas-cu12, etc.) on lines 206-221 and adding the cuDNN lib path on line 224 properly support source builds of extensions like flash-attn and mamba-ssm that expect these libraries to be discoverable.
Also applies to: 224-224
191-193: Good: Training hub 0.3.0 with documented version constraint handling.Line 191 correctly pins
training-hub==0.3.0, supporting the PR objective. The comments on lines 138-143 and 197-200 pragmatically document the known upstream kubernetes/urllib3 version mismatch and deferpip checkuntil the upstream constraint loosens. This is a reasonable approach given the constraint is external.Also applies to: 138-143, 197-200
22-22: Files exist; no issues found.Both
LICENSE.mdandentrypoint-universal.share present at their expected paths. The COPY operations on lines 22 and 288 will succeed without errors.
57025bf to
d62c537
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (3)
images/universal/training/py312-cuda128-torch280/Dockerfile (3)
68-71: Missingldconfigcall after updating/etc/ld.so.conf.d/nvidia.conf.Writing to ldconfig config files does not take effect until
ldconfigis run to rebuild the cache. Lines 68–69 append NVIDIA library paths to the config, but the cache is never rebuilt. Without this step, the CUDA libraries at/usr/local/nvidia/liband/usr/local/nvidia/lib64will not be discoverable at runtime or during the preflight check on line 88.Apply this diff to rebuild the ldconfig cache:
&& echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf \ && echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf \ + && ldconfig \ && dnf clean all \
183-183: Upgrade bitsandbytes to 0.48.2 or later for CUDA 12.8 compatibility.Line 183 pins
bitsandbytes==0.48.1, which was flagged as incompatible with CUDA 12.8 in prior reviews. Ensure consistency: if the fail-fast check on line 157 is updated to a newer version, this line must also be updated.Apply this diff:
- bitsandbytes==0.48.1 \ + bitsandbytes==0.48.2 \
75-78: Fix Hadolint DL3044: ENV statement cannot self-reference its own variables.Lines 75–78 violate Docker best practices (and trigger Hadolint DL3044) by referencing
${PATH}and${LD_LIBRARY_PATH}within the sameENVstatement where they are being set. This can cause unexpected behavior depending on the Docker version.Refactor into separate
ENVstatements to allow prepending to existing values:-ENV CUDA_HOME=/usr/local/cuda \ - PATH=/usr/local/nvidia/bin:${CUDA_HOME}/bin:${PATH} \ - LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:$CUDA_HOME/lib64:$CUDA_HOME/extras/CUPTI/lib64:$LD_LIBRARY_PATH \ - TORCH_CUDA_ARCH_LIST="${CUDA_ARCH_LIST}" +ENV CUDA_HOME=/usr/local/cuda \ + TORCH_CUDA_ARCH_LIST="${CUDA_ARCH_LIST}" +ENV PATH=/usr/local/nvidia/bin:${CUDA_HOME}/bin:${PATH} +ENV LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:${CUDA_HOME}/lib64:${CUDA_HOME}/extras/CUPTI/lib64:${LD_LIBRARY_PATH}
🧹 Nitpick comments (2)
images/universal/training/py312-cuda128-torch280/Dockerfile (2)
163-163: Simplify the find command to avoid Shellcheck SC2261 warning.Hadolint and Shellcheck flag line 163 for multiple redirections competing for stdout. This occurs within the fail-fast check when running
find /tmp -type f -name "*.o" | wc -l || true. While the|| truefallback handles errors gracefully, a cleaner pattern would silence the output on failure:- numba==0.62.1 bitsandbytes==0.48.1 || \ + numba==0.62.1 bitsandbytes==0.48.2 || \(Note: Once bitsandbytes is upgraded to 0.48.2 per the critical issue above, this refactor becomes moot.)
232-316: Mamba-SSM wheel build section: extensive logging is welcome, but document the timeout logic.Lines 232–316 implement an impressive deterministic 2-step wheel build with ccache support, detailed heartbeat logging, and a 20-minute timeout to prevent runaway compilation. The timeout and process-group cleanup logic are robust. However, the inline shell script (generated dynamically) is difficult to debug and maintain.
Recommendations for a future refactor:
- Extract the mamba-ssm build script into a separate file in the Dockerfile directory (e.g.,
build-mamba-ssm.sh) and COPY it into the image, making it easier to version control and debug.- Add a comment explaining why the 20-minute timeout is chosen (e.g., "Historical data shows mamba-ssm 2.2.6.post3 compiles in ~8–15 minutes on c5.4xlarge; 20 minutes provides 33% headroom").
- Document why
--no-depsis used for mamba-ssm and whether causal-conv1d is a hard requirement or optional fallback.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (2)
images/runtime/.DS_Storeis excluded by!**/.DS_Storeimages/runtime/training/.DS_Storeis excluded by!**/.DS_Store
📒 Files selected for processing (4)
.tekton/universal-image-py312-cuda128-torch280-pull-request.yaml(1 hunks).tekton/universal-image-py312-cuda128-torch280-push.yaml(1 hunks)images/universal/training/py312-cuda128-torch280/Dockerfile(5 hunks)images/universal/training/py312-cuda128-torch280/README.md(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- images/universal/training/py312-cuda128-torch280/README.md
🚧 Files skipped from review as they are similar to previous changes (1)
- .tekton/universal-image-py312-cuda128-torch280-pull-request.yaml
🧰 Additional context used
🪛 Hadolint (2.14.0)
images/universal/training/py312-cuda128-torch280/Dockerfile
[error] 75-75: Do not refer to an environment variable within the same ENV statement where it is defined.
(DL3044)
[error] 163-163: Multiple redirections compete for stdout. Use cat, tee, or pass filenames instead.
(SC2261)
🔇 Additional comments (5)
.tekton/universal-image-py312-cuda128-torch280-push.yaml (1)
20-21: Pipeline timeout extension aligns with expanded build scope.The 9-hour timeout complements the 8-hour timeout on the individual
build-containertask and accommodates the Dockerfile's extended diagnostics and multi-phase wheel builds.images/universal/training/py312-cuda128-torch280/Dockerfile (4)
142-142: Clarify the branch name and confirm the SDK repository exists.Line 142 installs the SDK from a git URL referencing the
training-hubbranch:git+https://github.com/briangallagher/sdk@training-hub. Confirm that this branch exists in that repository and is intended for this PR (as opposed to using a released PyPI version of the SDK).
319-319: No issues found — entrypoint-universal.sh exists and COPY will succeed.The file is present at the specified path. Line 319's COPY command will execute without errors during the Docker build.
22-22: LICENSE.md file verified — no action required.The LICENSE.md file exists at the specified path. The COPY instruction will execute successfully during the Docker build.
157-158: No action required — bitsandbytes 0.48.1 includes official CUDA 12.8 wheels.The bitsandbytes v0.48.x series includes CUDA 12.8 wheels with support for Linux x86_64, aarch64, and Windows. The pinned version at line 157 is correct and will not cause the runtime failures previously flagged in prior reviews. The fail-fast mechanism will function as intended.
fde6736 to
ba6e504
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
♻️ Duplicate comments (2)
images/universal/training/py312-cuda128-torch280/Dockerfile (2)
68-71: Addldconfigcall to rebuild dynamic linker cache after updating/etc/ld.so.conf.d/.Lines 68–71 append NVIDIA library paths to
/etc/ld.so.conf.d/nvidia.conf, but the dynamic linker cache is not rebuilt. Withoutldconfig, CUDA libraries will not be discoverable at line 88 (ldconfig grep verification) or at runtime. This was flagged in prior reviews.Apply this diff:
&& echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf \ && echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf \ + && ldconfig \ && dnf clean all \
157-157: Critical regression: bitsandbytes 0.48.1 lacks CUDA 12.8 wheels—upgrade required.Lines 157 and 183 still pin
bitsandbytes==0.48.1, which lacks CUDA 12.8 (cu128) wheel support. The fail-fast check at line 157 will fail during image build, and at runtime, the library fails to loadlibbitsandbytes_cuda128.Upgrade to
bitsandbytes==0.48.2or later (verified cu128 support):Line 157:
- numba==0.62.1 bitsandbytes==0.48.1 || \ + numba==0.62.1 bitsandbytes==0.48.2 || \Line 183:
- bitsandbytes==0.48.1 \ + bitsandbytes==0.48.2 \
🧹 Nitpick comments (1)
images/universal/training/py312-cuda128-torch280/Dockerfile (1)
232-235: Clarify the 2-stepcausal-conv1d+mamba-ssm --no-depspattern.Lines 232–235 use a 2-step installation with
--no-depsonmamba-ssm, which bypasses pip's dependency verification. While this pattern can improve build determinism, document why this explicit approach is necessary (e.g., "causal-conv1d is optional for mamba-ssm and required for performance; using--no-depsavoids resolver conflicts with training-hub").Alternatively, declare
causal-conv1d==1.5.3.post1in project requirements so pip resolves both transitive versions with proper constraints.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (2)
images/runtime/.DS_Storeis excluded by!**/.DS_Storeimages/runtime/training/.DS_Storeis excluded by!**/.DS_Store
📒 Files selected for processing (5)
.tekton/universal-image-py312-cuda128-torch280-pull-request.yaml(1 hunks).tekton/universal-image-py312-cuda128-torch280-push.yaml(1 hunks)images/universal/training/py312-cuda128-torch280/Dockerfile(5 hunks)images/universal/training/py312-cuda128-torch280/Dockerfile.debug(1 hunks)images/universal/training/py312-cuda128-torch280/README.md(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- .tekton/universal-image-py312-cuda128-torch280-push.yaml
- images/universal/training/py312-cuda128-torch280/README.md
🧰 Additional context used
🪛 Hadolint (2.14.0)
images/universal/training/py312-cuda128-torch280/Dockerfile
[error] 75-75: Do not refer to an environment variable within the same ENV statement where it is defined.
(DL3044)
[error] 163-163: Multiple redirections compete for stdout. Use cat, tee, or pass filenames instead.
(SC2261)
images/universal/training/py312-cuda128-torch280/Dockerfile.debug
[error] 75-75: Do not refer to an environment variable within the same ENV statement where it is defined.
(DL3044)
[error] 163-163: Multiple redirections compete for stdout. Use cat, tee, or pass filenames instead.
(SC2261)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Red Hat Konflux / universal-image-py312-cuda128-torch280-on-pull-request
🔇 Additional comments (3)
.tekton/universal-image-py312-cuda128-torch280-pull-request.yaml (1)
21-22: Pipeline timeout increase is appropriate for extended build workflow.The 9-hour timeout aligns with the expanded build steps (CUDA diagnostics, two-phase wheel compilation, extensive package installation) introduced in the updated Dockerfile. No other changes to pipeline structure or task definitions.
images/universal/training/py312-cuda128-torch280/Dockerfile.debug (1)
22-22: No issues found—required source files are present.Verification confirms both
LICENSE.mdandentrypoint-universal.shexist at the expected paths. The Dockerfile COPY commands at lines 22 and 319 will succeed.images/universal/training/py312-cuda128-torch280/Dockerfile (1)
22-22: ---Required source files verified to exist in repository.
Both
LICENSE.mdandentrypoint-universal.share committed and present at their expected locations. The Dockerfile COPY commands at lines 22 and 238 will succeed without issues.
| ENV CUDA_HOME=/usr/local/cuda \ | ||
| PATH=/usr/local/cuda/bin:$PATH \ | ||
| LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH \ | ||
| TORCH_CUDA_ARCH_LIST="8.0;8.6;8.9;9.0" | ||
|
|
||
| PATH=/usr/local/nvidia/bin:${CUDA_HOME}/bin:${PATH} \ | ||
| LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:$CUDA_HOME/lib64:$CUDA_HOME/extras/CUPTI/lib64:$LD_LIBRARY_PATH \ | ||
| TORCH_CUDA_ARCH_LIST="${CUDA_ARCH_LIST}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolve Hadolint DL3044: ENV statement self-references variable being defined.
Lines 75–78 violate Docker best practice by referencing ${CUDA_HOME} within the same ENV statement where it is being defined. Additionally, ${LD_LIBRARY_PATH} on line 77 references itself within the same statement. This was flagged in prior reviews but remains unresolved.
Refactor into separate ENV statements:
- ARG CUDA_ARCH_LIST=9.0
- ENV CUDA_HOME=/usr/local/cuda \
- PATH=/usr/local/nvidia/bin:${CUDA_HOME}/bin:${PATH} \
- LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:$CUDA_HOME/lib64:$CUDA_HOME/extras/CUPTI/lib64:$LD_LIBRARY_PATH \
- TORCH_CUDA_ARCH_LIST="${CUDA_ARCH_LIST}"
+ ARG CUDA_ARCH_LIST=9.0
+ ENV CUDA_HOME=/usr/local/cuda \
+ TORCH_CUDA_ARCH_LIST="${CUDA_ARCH_LIST}"
+ ENV PATH=/usr/local/nvidia/bin:${CUDA_HOME}/bin:${PATH}
+ ENV LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:${CUDA_HOME}/lib64:${CUDA_HOME}/extras/CUPTI/lib64:${LD_LIBRARY_PATH}🧰 Tools
🪛 Hadolint (2.14.0)
[error] 75-75: Do not refer to an environment variable within the same ENV statement where it is defined.
(DL3044)
🤖 Prompt for AI Agents
In images/universal/training/py312-cuda128-torch280/Dockerfile around lines 75
to 78, the single ENV statement self-references variables being defined
(CUDA_HOME and LD_LIBRARY_PATH) which triggers Hadolint DL3044; split into
multiple ENV statements: first set CUDA_HOME alone, then set PATH referencing
the already-defined CUDA_HOME, then set LD_LIBRARY_PATH referencing CUDA_HOME
and the prior LD_LIBRARY_PATH, and finally set TORCH_CUDA_ARCH_LIST in its own
ENV line so no variable is referenced in the same ENV where it is defined.
| && echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf \ | ||
| && echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf \ | ||
| && dnf clean all \ | ||
| && rm -rf /var/cache/dnf/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add ldconfig call to rebuild dynamic linker cache after updating /etc/ld.so.conf.d/.
Lines 68–71 append NVIDIA library paths to /etc/ld.so.conf.d/nvidia.conf, but the dynamic linker cache is not rebuilt. Without running ldconfig, the CUDA libraries will not be discoverable at line 88 (ldconfig grep check) or at runtime.
Apply this diff:
&& echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf \
&& echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf \
+ && ldconfig \
&& dnf clean all \📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| && echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf \ | |
| && echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf \ | |
| && dnf clean all \ | |
| && rm -rf /var/cache/dnf/* | |
| && echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf \ | |
| && echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf \ | |
| && ldconfig \ | |
| && dnf clean all \ | |
| && rm -rf /var/cache/dnf/* |
🤖 Prompt for AI Agents
In images/universal/training/py312-cuda128-torch280/Dockerfile.debug around
lines 68 to 71, after appending NVIDIA library paths to
/etc/ld.so.conf.d/nvidia.conf you must run ldconfig to rebuild the dynamic
linker cache so the CUDA libraries are discoverable later; insert a step that
runs ldconfig (as root) immediately after the echo lines and before the dnf
clean/all and rm -rf /var/cache/dnf/* lines.
| ENV CUDA_HOME=/usr/local/cuda \ | ||
| PATH=/usr/local/nvidia/bin:${CUDA_HOME}/bin:${PATH} \ | ||
| LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:$CUDA_HOME/lib64:$CUDA_HOME/extras/CUPTI/lib64:$LD_LIBRARY_PATH \ | ||
| TORCH_CUDA_ARCH_LIST="${CUDA_ARCH_LIST}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolve Hadolint DL3044: ENV statement self-references variable being defined.
Lines 75–78 violate Docker best practice by referencing ${CUDA_HOME} within the same ENV statement where it is being defined. Additionally, ${LD_LIBRARY_PATH} on line 77 references itself within the same statement. This prevents Docker from properly expanding values and is flagged by Hadolint DL3044.
Refactor into separate ENV statements:
- ARG CUDA_ARCH_LIST=9.0
- ENV CUDA_HOME=/usr/local/cuda \
- PATH=/usr/local/nvidia/bin:${CUDA_HOME}/bin:${PATH} \
- LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:$CUDA_HOME/lib64:$CUDA_HOME/extras/CUPTI/lib64:$LD_LIBRARY_PATH \
- TORCH_CUDA_ARCH_LIST="${CUDA_ARCH_LIST}"
+ ARG CUDA_ARCH_LIST=9.0
+ ENV CUDA_HOME=/usr/local/cuda \
+ TORCH_CUDA_ARCH_LIST="${CUDA_ARCH_LIST}"
+ ENV PATH=/usr/local/nvidia/bin:${CUDA_HOME}/bin:${PATH}
+ ENV LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:${CUDA_HOME}/lib64:${CUDA_HOME}/extras/CUPTI/lib64:${LD_LIBRARY_PATH}Committable suggestion skipped: line range outside the PR's diff.
🧰 Tools
🪛 Hadolint (2.14.0)
[error] 75-75: Do not refer to an environment variable within the same ENV statement where it is defined.
(DL3044)
🤖 Prompt for AI Agents
In images/universal/training/py312-cuda128-torch280/Dockerfile.debug around
lines 75 to 78, the ENV statement self-references variables being set (CUDA_HOME
and LD_LIBRARY_PATH), which prevents proper expansion and triggers Hadolint
DL3044; split the combined ENV into multiple ENV statements so CUDA_HOME is
defined first, then set PATH and LD_LIBRARY_PATH in subsequent ENV lines that
reference the already-defined CUDA_HOME, and set TORCH_CUDA_ARCH_LIST in its own
ENV (or after CUDA_HOME) to ensure variables are expanded correctly.
|
/retest |
ba6e504 to
fa75edd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (5)
images/universal/training/py312-cuda128-torch280/Dockerfile (3)
68-71: Addldconfigcall after updating/etc/ld.so.conf.d/nvidia.conf.Lines 68–69 append NVIDIA library paths to ldconfig config, but the dynamic linker cache is not rebuilt. Without
ldconfig, the CUDA libraries will not be discoverable at line 88 (ldconfig grep check) or at runtime. This was flagged in prior reviews but remains unresolved.Apply this diff:
&& echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf \ && echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf \ + && ldconfig \ && dnf clean all \
75-78: Hadolint DL3044 violation: ENV statement self-references variables being defined.Lines 75–78 violate Docker best practices by referencing
${CUDA_HOME}and${LD_LIBRARY_PATH}within the sameENVstatement where they are being defined. This was flagged in prior reviews but remains unresolved.Split into separate
ENVstatements:+ ARG CUDA_ARCH_LIST=9.0 + ENV CUDA_HOME=/usr/local/cuda \ + TORCH_CUDA_ARCH_LIST="${CUDA_ARCH_LIST}" + ENV PATH=/usr/local/nvidia/bin:${CUDA_HOME}/bin:${PATH} + ENV LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:${CUDA_HOME}/lib64:${CUDA_HOME}/extras/CUPTI/lib64:${LD_LIBRARY_PATH} - ARG CUDA_ARCH_LIST=9.0 - ENV CUDA_HOME=/usr/local/cuda \ - PATH=/usr/local/nvidia/bin:${CUDA_HOME}/bin:${PATH} \ - LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:$CUDA_HOME/lib64:$CUDA_HOME/extras/CUPTI/lib64:$LD_LIBRARY_PATH \ - TORCH_CUDA_ARCH_LIST="${CUDA_ARCH_LIST}"
157-157: Update bitsandbytes from 0.48.1 to 0.48.2 at lines 157 and 183.bitsandbytes 0.48.1 has a confirmed packaging/build problem with CUDA 12.8 wheels, and 0.48.2 contains the fix. The Dockerfile must apply the suggested diffs to both lines:
Line 157:
- numba==0.62.1 bitsandbytes==0.48.1 || \ + numba==0.62.1 bitsandbytes==0.48.2 || \Line 183:
- bitsandbytes==0.48.1 \ + bitsandbytes==0.48.2 \images/universal/training/py312-cuda128-torch280/Dockerfile.debug (2)
68-71: Addldconfigcall after updating/etc/ld.so.conf.d/nvidia.conf.Lines 68–69 append NVIDIA library paths to ldconfig config, but the dynamic linker cache is not rebuilt. Without
ldconfig, the CUDA libraries will not be discoverable at line 88 (ldconfig grep check) or at runtime. This was flagged in prior reviews but remains unresolved.Apply this diff:
&& echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf \ && echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf \ + && ldconfig \ && dnf clean all \
75-78: Hadolint DL3044 violation: ENV statement self-references variables being defined.Lines 75–78 violate Docker best practices by referencing
${CUDA_HOME}and${LD_LIBRARY_PATH}within the sameENVstatement where they are being defined. This was flagged in prior reviews but remains unresolved.Split into separate
ENVstatements:+ ARG CUDA_ARCH_LIST=9.0 + ENV CUDA_HOME=/usr/local/cuda \ + TORCH_CUDA_ARCH_LIST="${CUDA_ARCH_LIST}" + ENV PATH=/usr/local/nvidia/bin:${CUDA_HOME}/bin:${PATH} + ENV LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:${CUDA_HOME}/lib64:${CUDA_HOME}/extras/CUPTI/lib64:${LD_LIBRARY_PATH} - ARG CUDA_ARCH_LIST=9.0 - ENV CUDA_HOME=/usr/local/cuda \ - PATH=/usr/local/nvidia/bin:${CUDA_HOME}/bin:${PATH} \ - LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:$CUDA_HOME/lib64:$CUDA_HOME/extras/CUPTI/lib64:$LD_LIBRARY_PATH \ - TORCH_CUDA_ARCH_LIST="${CUDA_ARCH_LIST}"
🧹 Nitpick comments (2)
images/universal/training/py312-cuda128-torch280/Dockerfile (1)
232-235: Verify mamba-ssm install pattern and document rationale.The two-step install (
causal-conv1dthenmamba-ssm --no-deps) assumes causal-conv1d satisfies all of mamba-ssm's dependencies. While this pattern can improve determinism, using--no-depsbypasses pip's dependency verification, which carries risk if mamba-ssm has undeclared dependencies or if version incompatibilities arise.Consider adding a comment explaining why this explicit pattern is necessary (e.g., "avoids resolver conflicts with training-hub") or refactor to a single install with both packages declared.
images/universal/training/py312-cuda128-torch280/Dockerfile.debug (1)
232-316: Complex mamba-ssm build wrapper in debug variant — verify necessity and maintainability.This file's lines 232–316 introduce a 84-line shell script that wraps the mamba-ssm wheel build with heartbeat logging, timeout enforcement (20 min), and process monitoring. This is significantly more elaborate than the simple 3-line install pattern in the main Dockerfile (lines 232–235).
Questions:
- Is the heartbeat/timeout wrapper necessary only for the debug variant, or should it also be in the production Dockerfile?
- Does the script properly handle edge cases (e.g., timeouts during ccache setup)?
- Is the 20-minute hardcoded timeout appropriate for all platforms/architectures?
Recommendation: Add a comment explaining the rationale (e.g., "Debug builds may run longer due to extended diagnostics; timeout prevents hung builds") and consider whether similar protections are needed in the production build.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (2)
images/runtime/.DS_Storeis excluded by!**/.DS_Storeimages/runtime/training/.DS_Storeis excluded by!**/.DS_Store
📒 Files selected for processing (5)
.tekton/universal-image-py312-cuda128-torch280-pull-request.yaml(1 hunks).tekton/universal-image-py312-cuda128-torch280-push.yaml(1 hunks)images/universal/training/py312-cuda128-torch280/Dockerfile(5 hunks)images/universal/training/py312-cuda128-torch280/Dockerfile.debug(1 hunks)images/universal/training/py312-cuda128-torch280/README.md(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- .tekton/universal-image-py312-cuda128-torch280-push.yaml
- .tekton/universal-image-py312-cuda128-torch280-pull-request.yaml
- images/universal/training/py312-cuda128-torch280/README.md
🧰 Additional context used
🪛 Hadolint (2.14.0)
images/universal/training/py312-cuda128-torch280/Dockerfile.debug
[error] 75-75: Do not refer to an environment variable within the same ENV statement where it is defined.
(DL3044)
[error] 163-163: Multiple redirections compete for stdout. Use cat, tee, or pass filenames instead.
(SC2261)
images/universal/training/py312-cuda128-torch280/Dockerfile
[error] 75-75: Do not refer to an environment variable within the same ENV statement where it is defined.
(DL3044)
[error] 163-163: Multiple redirections compete for stdout. Use cat, tee, or pass filenames instead.
(SC2261)
🔇 Additional comments (3)
images/universal/training/py312-cuda128-torch280/Dockerfile (1)
238-238: Verify thatentrypoint-universal.shexists in the repository.Line 238 copies
./images/universal/training/py312-cuda128-torch280/entrypoint-universal.shinto the image. This file must exist in the repository for the build to succeed.images/universal/training/py312-cuda128-torch280/Dockerfile.debug (2)
319-319: Verify thatentrypoint-universal.shexists in the repository.Line 319 copies
./images/universal/training/py312-cuda128-torch280/entrypoint-universal.shinto the image. This file must exist in the repository for the build to succeed.
157-157: Review comment is incorrect — bitsandbytes 0.48.1 does provide CUDA 12.8 wheels.Web search confirms that bitsandbytes 0.48.1 (released Oct 2, 2025) includes binaries targeted for CUDA 12.8/12.9 on PyPI. The review's claim that version 0.48.1 "lacks CUDA 12.8 wheels" is factually incorrect.
Additionally, upgrading to the recommended 0.48.2 would be counterproductive: bitsandbytes 0.48.2 (Oct 29, 2025) targets CUDA 11.7–12.6, which does not include CUDA 12.8 support. For a Dockerfile explicitly targeting CUDA 12.8 (as indicated by the filename
py312-cuda128-torch280), the current pinned version 0.48.1 is correct.No changes are needed. The Dockerfile lines 157 and 183 should remain as-is.
Likely an incorrect or invalid review comment.
fa75edd to
d4015cd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (4)
images/universal/training/py312-cuda128-torch280/Dockerfile.debug (2)
68-71:⚠️ CRITICAL: Missingldconfigcall after updating ld.so.conf.d.Lines 68–71 append NVIDIA library paths to
/etc/ld.so.conf.d/nvidia.conf, but do not rebuild the dynamic linker cache. Without runningldconfig, the CUDA libraries will not be discoverable at the preflight check (line 88) or at runtime. The fail-fast check will fail immediately.Apply this diff to add ldconfig:
&& echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf \ && echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf \ + && ldconfig \ && dnf clean all \
75-78:⚠️ CRITICAL: Resolve Hadolint DL3044 — ENV self-references variables being defined.Lines 75–78 violate Docker best practice by referencing
${CUDA_HOME}within the sameENVstatement where it is being defined. Additionally,${LD_LIBRARY_PATH}on line 77 references itself within the same statement. This prevents Docker from properly expanding these values and is flagged by Hadolint DL3044.Refactor into separate
ENVstatements:- ARG CUDA_ARCH_LIST=9.0 - ENV CUDA_HOME=/usr/local/cuda \ - PATH=/usr/local/nvidia/bin:${CUDA_HOME}/bin:${PATH} \ - LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:$CUDA_HOME/lib64:$CUDA_HOME/extras/CUPTI/lib64:$LD_LIBRARY_PATH \ - TORCH_CUDA_ARCH_LIST="${CUDA_ARCH_LIST}" + ARG CUDA_ARCH_LIST=9.0 + ENV CUDA_HOME=/usr/local/cuda \ + TORCH_CUDA_ARCH_LIST="${CUDA_ARCH_LIST}" + ENV PATH=/usr/local/nvidia/bin:${CUDA_HOME}/bin:${PATH} + ENV LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:${CUDA_HOME}/lib64:${CUDA_HOME}/extras/CUPTI/lib64:${LD_LIBRARY_PATH}images/universal/training/py312-cuda128-torch280/Dockerfile (2)
68-71:⚠️ CRITICAL: Missingldconfigcall after updating ld.so.conf.d.Lines 68–71 append NVIDIA library paths to
/etc/ld.so.conf.d/nvidia.conf, but do not rebuild the dynamic linker cache. Without runningldconfig, the CUDA libraries will not be discoverable at the preflight check (line 88) or at runtime. The fail-fast check will fail immediately.Note: This issue was flagged in prior reviews as addressed (commit a8ad8e6) but remains unresolved in the current code.
Apply this diff to add ldconfig:
&& echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf \ && echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf \ + && ldconfig \ && dnf clean all \
75-78:⚠️ CRITICAL: Resolve Hadolint DL3044 — ENV self-references variables being defined.Lines 75–78 violate Docker best practice by referencing
${CUDA_HOME}within the sameENVstatement where it is being defined. Additionally,${LD_LIBRARY_PATH}on line 77 references itself within the same statement. This prevents Docker from properly expanding these values and is flagged by Hadolint DL3044.Note: This issue was flagged in prior reviews as addressed (commit a8ad8e6) but remains unresolved in the current code.
Refactor into separate
ENVstatements:- ARG CUDA_ARCH_LIST=9.0 - ENV CUDA_HOME=/usr/local/cuda \ - PATH=/usr/local/nvidia/bin:${CUDA_HOME}/bin:${PATH} \ - LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:$CUDA_HOME/lib64:$CUDA_HOME/extras/CUPTI/lib64:$LD_LIBRARY_PATH \ - TORCH_CUDA_ARCH_LIST="${CUDA_ARCH_LIST}" + ARG CUDA_ARCH_LIST=9.0 + ENV CUDA_HOME=/usr/local/cuda \ + TORCH_CUDA_ARCH_LIST="${CUDA_ARCH_LIST}" + ENV PATH=/usr/local/nvidia/bin:${CUDA_HOME}/bin:${PATH} + ENV LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:${CUDA_HOME}/lib64:${CUDA_HOME}/extras/CUPTI/lib64:${LD_LIBRARY_PATH}
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (2)
images/runtime/.DS_Storeis excluded by!**/.DS_Storeimages/runtime/training/.DS_Storeis excluded by!**/.DS_Store
📒 Files selected for processing (5)
.tekton/universal-image-py312-cuda128-torch280-pull-request.yaml(4 hunks).tekton/universal-image-py312-cuda128-torch280-push.yaml(4 hunks)images/universal/training/py312-cuda128-torch280/Dockerfile(5 hunks)images/universal/training/py312-cuda128-torch280/Dockerfile.debug(1 hunks)images/universal/training/py312-cuda128-torch280/README.md(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- images/universal/training/py312-cuda128-torch280/README.md
🚧 Files skipped from review as they are similar to previous changes (1)
- .tekton/universal-image-py312-cuda128-torch280-pull-request.yaml
🧰 Additional context used
🪛 Hadolint (2.14.0)
images/universal/training/py312-cuda128-torch280/Dockerfile
[error] 75-75: Do not refer to an environment variable within the same ENV statement where it is defined.
(DL3044)
[error] 163-163: Multiple redirections compete for stdout. Use cat, tee, or pass filenames instead.
(SC2261)
images/universal/training/py312-cuda128-torch280/Dockerfile.debug
[error] 75-75: Do not refer to an environment variable within the same ENV statement where it is defined.
(DL3044)
[error] 163-163: Multiple redirections compete for stdout. Use cat, tee, or pass filenames instead.
(SC2261)
🔇 Additional comments (4)
.tekton/universal-image-py312-cuda128-torch280-push.yaml (1)
20-21: ✓ Tekton pipeline changes look good.Timeout, build-platforms parameter declaration, and PLATFORM task parameter changes support multi-platform build orchestration consistently. The hardcoded amd64 default aligns with the Dockerfile's amd64 requirement for FlashAttention (visible in the companion Dockerfile at line 121).
Also applies to: 31-33, 108-111, 199-200
images/universal/training/py312-cuda128-torch280/Dockerfile (2)
232-239: ✓ Simplified 2-step installation approach is solid.The 2-step pattern for causal-conv1d and mamba-ssm (lines 232–239) is cleaner than the Dockerfile.debug equivalent: it installs both dependencies in a straightforward manner without unnecessary --no-deps flags or dependency-resolution bypass. This improves auditability and allows pip to verify compatibility.
22-22: All required COPY source files exist—no issues found.Both
LICENSE.mdandentrypoint-universal.share present in the repository at their specified paths. The COPY commands on lines 22 and 242 will execute successfully.images/universal/training/py312-cuda128-torch280/Dockerfile.debug (1)
22-22: No issues found—required COPY source files are present.Both
LICENSE.mdandentrypoint-universal.shexist at their specified paths in the repository. The Docker build will not fail due to missing COPY source files.
| # Fail-fast: ensure binary wheels exist for packages that are expensive to build | ||
| RUN mkdir -p /tmp/wheels && \ | ||
| python -m pip download --retries 5 --timeout 120 --only-binary=:all: --no-deps -d /tmp/wheels \ | ||
| numba==0.62.1 bitsandbytes==0.48.1 || \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lines 157 and 183 pin bitsandbytes==0.48.1, which does not provide CUDA 12.8 (cu128) wheel variants on PyPI. The fail-fast check at lines 155–158 will fail, and at runtime, the library will fail to load libbitsandbytes_cuda128. Web search confirms that bitsandbytes 0.48.2+ provides cu128 support.
Note: This issue was flagged in prior reviews as addressed (commits 02c9a56, ee7d064) but remains unresolved in the current code.
Upgrade to bitsandbytes==0.48.2 or later in both locations:
- RUN mkdir -p /tmp/wheels && \
- python -m pip download --retries 5 --timeout 120 --only-binary=:all: --no-deps -d /tmp/wheels \
- numba==0.62.1 bitsandbytes==0.48.1 || \
+ RUN mkdir -p /tmp/wheels && \
+ python -m pip download --retries 5 --timeout 120 --only-binary=:all: --no-deps -d /tmp/wheels \
+ numba==0.62.1 bitsandbytes==0.48.2 || \And:
tensorboard==2.19.0 \
- bitsandbytes==0.48.1 \
+ bitsandbytes==0.48.2 \
liger-kernel==0.6.2 \Also applies to: 183-183
🤖 Prompt for AI Agents
images/universal/training/py312-cuda128-torch280/Dockerfile around lines 157 and
183: bitsandbytes is pinned to 0.48.1 which lacks CUDA 12.8 wheels; update both
occurrences to bitsandbytes==0.48.2 (or a later compatible patch) so the
Dockerfile and any fail-fast checks reference a release that provides cu128
wheels, and ensure both line 157 and line 183 are changed to the new version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (3)
images/universal/training/py312-cuda128-torch280/Dockerfile (3)
69-72: ** Split ENV statement to avoid self-referencing variables (Hadolint DL3044).**This issue was flagged in prior reviews but remains unresolved. The single
ENVstatement self-references${PATH}and${LD_LIBRARY_PATH}, which violates Dockerfile best practices. Split into separate statements:- ENV CUDA_HOME=/usr/local/cuda \ - PATH=/usr/local/cuda/bin:$PATH \ - LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH \ - TORCH_CUDA_ARCH_LIST="8.0;8.6;8.9;9.0" + ENV CUDA_HOME=/usr/local/cuda + ENV PATH=/usr/local/cuda/bin:${PATH} + ENV LD_LIBRARY_PATH=/usr/local/cuda/lib64:${CUDA_HOME}/lib64:${LD_LIBRARY_PATH} + ENV TORCH_CUDA_ARCH_LIST="8.0;8.6;8.9;9.0"
140-140: ** Upgrade bitsandbytes to 0.48.2 or later (critical CUDA 12.8 support).**bitsandbytes 0.48.1 lacks official CUDA 12.8 wheels on PyPI and will fail at runtime. This critical issue was flagged in multiple prior reviews and commits. Upgrade to 0.48.2 or later, which provides confirmed cu128 support.
- bitsandbytes==0.48.1 \ + bitsandbytes==0.48.2 \Also update line 155 (pip download step) if present.
167-170: ** Remove--no-depsflag and install both packages in single command.**The
--no-depsflag on line 169 bypasses pip's dependency resolution, preventing verification that causal-conv1d and mamba-ssm are compatible. This pattern was flagged in prior reviews. Install both together without--no-depsto enable proper validation:- # Deterministic 2-step: sub-dep first, then parent without deps - RUN pip install --no-build-isolation --no-cache-dir causal-conv1d==1.5.3.post1 && \ - pip install --no-build-isolation --no-cache-dir mamba-ssm==2.2.6.post3 --no-deps && \ + # Install mamba-ssm with its optional causal-conv1d dependency + RUN pip install --no-build-isolation --no-cache-dir causal-conv1d==1.5.3.post1 mamba-ssm==2.2.6.post3 && \ fix-permissions /opt/app-root -P
🧹 Nitpick comments (2)
images/universal/training/py312-cuda128-torch280/Dockerfile (2)
22-23: Resolve the LICENSE.md TODO before release.The commented-out COPY instruction should either be re-enabled with the file added to the repo, or removed entirely if licensing is handled elsewhere. Clarify the intent and complete the task.
Do you want me to help you either generate a suitable LICENSE.md file or remove this TODO entirely?
161-166: Remove duplicate TORCH_CUDA_ARCH_LIST and pin versions for build helpers.Line 165 duplicates the TORCH_CUDA_ARCH_LIST already defined on line 72. Additionally, line 162 uses
pip install -Uwhich unpins versions and can cause drift; provide explicit versions for reproducibility.# Build helpers to compile PyTorch extensions - RUN pip install -U pip setuptools wheel ninja cmake + RUN pip install --no-cache-dir pip setuptools wheel ninja cmake - # Optional: set GPU archs if you hit arch issues - ENV TORCH_CUDA_ARCH_LIST="8.0;8.6;8.9;9.0"
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
images/universal/training/py312-cuda128-torch280/Dockerfile(3 hunks)
🧰 Additional context used
🪛 Hadolint (2.14.0)
images/universal/training/py312-cuda128-torch280/Dockerfile
[error] 120-120: Multiple redirections compete for stdout. Use cat, tee, or pass filenames instead.
(SC2261)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Red Hat Konflux / universal-image-py312-cuda128-torch280-on-pull-request
🔇 Additional comments (2)
images/universal/training/py312-cuda128-torch280/Dockerfile (2)
112-112: Clarify the SDK branch reference.Line 112 references
add-training-hubbranch instead of a release branch. Confirm whether this is intentional for development/testing or if it should reference a stable branch (e.g.,mainor a version tag). The past review comments indicated this was temporary.
173-178: File exists and permissions are correctly configured.The verification confirms that
entrypoint-universal.shexists in the repository at the expected location and will receive the correct executable permissions (0755) via the--chmodflag during the Docker build. The build will not fail due to a missing file.
Signed-off-by: Brian Gallagher <[email protected]>
5598440 to
27d20b4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
images/universal/training/py312-cuda128-torch280/Dockerfile.debug (1)
75-78: Resolve Hadolint DL3044: Split self-referencing ENV statement.Lines 75–78 violate Docker best practices (Hadolint DL3044) by referencing
${CUDA_HOME},${PATH}, and${LD_LIBRARY_PATH}within the sameENVstatement where they are being defined. This prevents proper variable expansion and is unreliable across Docker versions.Refactor into separate
ENVstatements:- ARG CUDA_ARCH_LIST=9.0 - ENV CUDA_HOME=/usr/local/cuda \ - PATH=/usr/local/nvidia/bin:${CUDA_HOME}/bin:${PATH} \ - LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:$CUDA_HOME/lib64:$CUDA_HOME/extras/CUPTI/lib64:$LD_LIBRARY_PATH \ - TORCH_CUDA_ARCH_LIST="${CUDA_ARCH_LIST}" + ARG CUDA_ARCH_LIST=9.0 + ENV CUDA_HOME=/usr/local/cuda \ + TORCH_CUDA_ARCH_LIST="${CUDA_ARCH_LIST}" + ENV PATH=/usr/local/nvidia/bin:${CUDA_HOME}/bin:${PATH} + ENV LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:${CUDA_HOME}/lib64:${CUDA_HOME}/extras/CUPTI/lib64:${LD_LIBRARY_PATH}
🧹 Nitpick comments (1)
images/universal/training/py312-cuda128-torch280/Dockerfile.debug (1)
234-316: Consider extracting complex wheel-build logic into a separate shell script for maintainability.Lines 234–316 contain a highly complex multi-line
RUNstatement that usesprintfto generate an inline shell script with ccache logic, timeouts, heartbeat monitoring, and error handling. While the defensive timeout logic and diagnostics are valuable, embedding this complexity makes the Dockerfile harder to audit and maintain.Suggested improvement (for a follow-up refactor): Extract the script generation and execution into a separate build script file (e.g.,
build-mamba-wheel.sh), then invoke it from the Dockerfile. This would:
- Separate build logic from container configuration
- Improve readability and testability
- Enable syntax highlighting and linting on the script separately
- Simplify the Dockerfile layer
For now, the current approach is functional. No blocking issue, but worth considering for future maintenance.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (2)
images/runtime/.DS_Storeis excluded by!**/.DS_Storeimages/runtime/training/.DS_Storeis excluded by!**/.DS_Store
📒 Files selected for processing (5)
.tekton/universal-image-py312-cuda128-torch280-pull-request.yaml(1 hunks).tekton/universal-image-py312-cuda128-torch280-push.yaml(1 hunks)images/universal/training/py312-cuda128-torch280/Dockerfile(3 hunks)images/universal/training/py312-cuda128-torch280/Dockerfile.debug(1 hunks)images/universal/training/py312-cuda128-torch280/README.md(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- .tekton/universal-image-py312-cuda128-torch280-push.yaml
- images/universal/training/py312-cuda128-torch280/README.md
🧰 Additional context used
🪛 Hadolint (2.14.0)
images/universal/training/py312-cuda128-torch280/Dockerfile
[error] 120-120: Multiple redirections compete for stdout. Use cat, tee, or pass filenames instead.
(SC2261)
images/universal/training/py312-cuda128-torch280/Dockerfile.debug
[error] 75-75: Do not refer to an environment variable within the same ENV statement where it is defined.
(DL3044)
[error] 163-163: Multiple redirections compete for stdout. Use cat, tee, or pass filenames instead.
(SC2261)
🔇 Additional comments (6)
.tekton/universal-image-py312-cuda128-torch280-pull-request.yaml (1)
21-22: Add pipeline timeout for long-running builds.The 9-hour timeout is appropriate for CUDA toolchain compilation and FlashAttention builds. This aligns with the PR objective and matches similar timeout settings across related Tekton pipelines.
images/universal/training/py312-cuda128-torch280/Dockerfile (4)
112-112: Verify SDK git branch is finalized before merge.Line 112 references a feature branch:
git+https://github.com/briangallagher/sdk@add-training-hub. Ensure this branch name reflects the final intended state; temporary or feature-specific branches should be resolved to a stable tag or main branch before release.Also applies to: 112-112
173-173: Entrypoint wrapper wiring is correct and aligns with debug variant.The addition of a POSIX entrypoint wrapper at line 173 enables both workbench and headless runtime modes. This mirrors the approach in
Dockerfile.debugand is well-motivated by the PR objective to support Training Hub 0.3.0.Also applies to: 173-173
140-140: The review comment is factually incorrect and should be disregarded.bitsandbytes 0.48.1 was released (Oct 2, 2025) and the published PyPI wheels are built to be compatible with CUDA 12.8 (cu128). The Dockerfile correctly pins
bitsandbytes==0.48.1, which already provides the necessary cu128 support. No upgrade is required.The original assertion that bitsandbytes 0.48.1 "lacks official CUDA 12.8 (cu128) wheel variants on PyPI" is inaccurate. The code is correct as-is.
Likely an incorrect or invalid review comment.
168-170: The review comment incorrectly characterizes this pattern as risky.The current code is functionally safe. mamba-ssm 2.2.6.post3 does not hard-require causal-conv1d but exposes it as an optional extra, with a recommended constraint of causal-conv1d>=1.4.0. Since
causal-conv1d==1.5.3.post1is installed on line 168 before mamba-ssm, the optional dependency is already satisfied. The--no-depsflag does skip dependency resolution for mamba-ssm, but this does not create a broken state because:
- causal-conv1d is pre-installed and compatible
- mamba-ssm will function correctly with or without causal-conv1d (it's optional, not required)
- No validation error occurs
The suggested refactoring (combining both into a single install command) is valid and arguably cleaner, but it addresses a stylistic preference, not a critical issue. The current code works as intended.
Likely an incorrect or invalid review comment.
images/universal/training/py312-cuda128-torch280/Dockerfile.debug (1)
319-319: Verify the actual build context and COPY path behavior before applying this change.The review comment's reasoning contains a critical error. If the build context is the repository root (as claimed), then
COPY entrypoint-universal.shin the main Dockerfile would fail—the file is not at the repo root but atimages/universal/training/py312-cuda128-torch280/entrypoint-universal.sh. The full path used in the debug Dockerfile would actually be the one that works from that context.Additionally, Dockerfile.debug is not referenced in any Tekton pipeline configuration—only the main Dockerfile is built. Before assuming the debug variant should match the main variant, confirm:
- What is the actual build context used when buildah invokes the Dockerfile?
- Does the main Dockerfile's relative path actually work in practice?
- Is Dockerfile.debug intended for CI/CD builds or manual/development use only?
If both need to coexist, ensure the paths work correctly for their actual build contexts rather than assuming consistency without verification.
| # Fail-fast: ensure binary wheels exist for packages that are expensive to build | ||
| RUN mkdir -p /tmp/wheels && \ | ||
| python -m pip download --retries 5 --timeout 120 --only-binary=:all: --no-deps -d /tmp/wheels \ | ||
| numba==0.62.1 bitsandbytes==0.48.1 || \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 157 attempts to download bitsandbytes==0.48.1, which does not provide CUDA 12.8 (cu128) wheel variants on PyPI. The fail-fast check at lines 155–158 will fail, preventing the image build from proceeding. Additionally, line 183 pins the same version for the main install, so both must be updated.
Upgrade both occurrences to bitsandbytes==0.48.2 or later (confirmed cu128 support):
- RUN mkdir -p /tmp/wheels && \
- python -m pip download --retries 5 --timeout 120 --only-binary=:all: --no-deps -d /tmp/wheels \
- numba==0.62.1 bitsandbytes==0.48.1 || \
+ RUN mkdir -p /tmp/wheels && \
+ python -m pip download --retries 5 --timeout 120 --only-binary=:all: --no-deps -d /tmp/wheels \
+ numba==0.62.1 bitsandbytes==0.48.2 || \And at line 183:
tensorboard==2.19.0 \
- bitsandbytes==0.48.1 \
+ bitsandbytes==0.48.2 \
liger-kernel==0.6.2 \Also applies to: 157-157
🤖 Prompt for AI Agents
In images/universal/training/py312-cuda128-torch280/Dockerfile.debug around line
157 (and also update the matching pin at line 183), the Dockerfile currently
installs bitsandbytes==0.48.1 which lacks CUDA 12.8 (cu128) wheels and causes
the fail-fast check to fail; change both pins to bitsandbytes==0.48.2 (or later
confirmed to include cu128 wheels) so the fail-fast check and main install
succeed, keeping the version consistent in both locations.
27d20b4 to
c5ae569
Compare
Signed-off-by: Brian Gallagher <[email protected]>
c5ae569 to
2f92670
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
images/universal/training/py312-cuda128-torch280/Dockerfile (2)
69-72: I need to verify the Hadolint DL3044 violation claim and the proposed solution.Based on my verification, Docker will not expand a variable within the same ENV statement where it is defined, which is exactly what Hadolint DL3044 flags. The review comment correctly identifies this issue—the original code attempts to reference
$PATHand$LD_LIBRARY_PATHwithin the same ENV statement where they're being modified.However, the proposed diff contains a problem: it adds
${CUDA_HOME}/binto the PATH which wasn't in the original code. The original only references/usr/local/cuda/bin:$PATH, while the diff proposes/usr/local/cuda/bin:${CUDA_HOME}/bin:${PATH}(duplicating the CUDA path).Corrected approach: Split ENV into separate statements to fix DL3044, without changing the path values.
-ENV CUDA_HOME=/usr/local/cuda \ - PATH=/usr/local/cuda/bin:$PATH \ - LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH \ - TORCH_CUDA_ARCH_LIST="8.0;8.6;8.9;9.0" +ENV CUDA_HOME=/usr/local/cuda \ + TORCH_CUDA_ARCH_LIST="8.0;8.6;8.9;9.0" +ENV PATH=/usr/local/cuda/bin:${PATH} +ENV LD_LIBRARY_PATH=/usr/local/cuda/lib64:${LD_LIBRARY_PATH}This corrects the DL3044 violation while preserving the original intended behavior: appending to existing PATH and LD_LIBRARY_PATH across separate ENV statements, which properly enables variable expansion.
120-158: Upgrade bitsandbytes to 0.48.2 for CUDA 12.8 support.bitsandbytes 0.48.1 does not publish cu128-tagged wheels on PyPI. Line 140 pins 0.48.1, which will fail at runtime in this CUDA 12.8 environment. bitsandbytes v0.48.2 provides prebuilt wheels supporting CUDA 12.8.
tensorboard==2.19.0 \ - bitsandbytes==0.48.1 \ + bitsandbytes==0.48.2 \ liger-kernel==0.6.2 \
♻️ Duplicate comments (3)
images/universal/training/py312-cuda128-torch280/Dockerfile.debug (3)
68-71: Addldconfigcall to rebuild dynamic linker cache after updating/etc/ld.so.conf.d/nvidia.conf. Lines 68–71 append NVIDIA library paths to the ldconfig configuration file, but the dynamic linker cache is not rebuilt. Without runningldconfig, the CUDA libraries at those paths will not be discoverable at line 88 (ldconfig grep check) or at runtime.Apply this diff to rebuild the cache:
&& echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf \ && echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf \ + && ldconfig \ && dnf clean all \
155-158: Critical: bitsandbytes 0.48.1 fails fail-fast check due to missing CUDA 12.8 wheels. Lines 157 pinsbitsandbytes==0.48.1, which does not provide cu128-tagged wheel variants on PyPI. The fail-fast check will exit with status 1 during the build, preventing image completion. This issue was flagged in prior reviews but remains unresolved in the debug variant.Upgrade to
bitsandbytes==0.48.2or later:- RUN mkdir -p /tmp/wheels && \ - python -m pip download --retries 5 --timeout 120 --only-binary=:all: --no-deps -d /tmp/wheels \ - numba==0.62.1 bitsandbytes==0.48.1 || \ + RUN mkdir -p /tmp/wheels && \ + python -m pip download --retries 5 --timeout 120 --only-binary=:all: --no-deps -d /tmp/wheels \ + numba==0.62.1 bitsandbytes==0.48.2 || \Also update line 183 to match:
tensorboard==2.19.0 \ - bitsandbytes==0.48.1 \ + bitsandbytes==0.48.2 \ liger-kernel==0.6.2 \
75-78: Hadolint DL3044 violation: ENV statement self-references variables being defined. Lines 76–77 reference${CUDA_HOME},${PATH}, and${LD_LIBRARY_PATH}within the sameENVstatement where they are being set, violating Docker best practice. Split into separateENVstatements to allow proper variable expansion.Apply this diff:
-ARG CUDA_ARCH_LIST=9.0 -ENV CUDA_HOME=/usr/local/cuda \ - PATH=/usr/local/nvidia/bin:${CUDA_HOME}/bin:${PATH} \ - LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:$CUDA_HOME/lib64:$CUDA_HOME/extras/CUPTI/lib64:$LD_LIBRARY_PATH \ - TORCH_CUDA_ARCH_LIST="${CUDA_ARCH_LIST}" +ARG CUDA_ARCH_LIST=9.0 +ENV CUDA_HOME=/usr/local/cuda \ + TORCH_CUDA_ARCH_LIST="${CUDA_ARCH_LIST}" +ENV PATH=/usr/local/nvidia/bin:${CUDA_HOME}/bin:${PATH} +ENV LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:${CUDA_HOME}/lib64:${CUDA_HOME}/extras/CUPTI/lib64:${LD_LIBRARY_PATH}
🧹 Nitpick comments (1)
images/universal/training/py312-cuda128-torch280/Dockerfile.debug (1)
232-316: Deterministic 2-step wheel build with comprehensive logging and timeout is a strong pattern. The build approach (lines 234–316) combines ccache caching, parallel compilation, detailed logging, heartbeat monitoring, and a 22-minute timeout with 120-second kill grace period. This significantly improves debuggability and prevents indefinite hangs. The inline bash script is verbose but appropriate for a debug variant.As a future nice-to-have, consider extracting the long bash script into a separate shell file (e.g.,
build_mamba_wheel.sh) within the repository and COPY it into the image to improve readability and testability. This is not urgent for the current PR but would simplify future maintenance.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
.tekton/universal-image-py312-cuda128-torch280-pull-request.yaml(1 hunks).tekton/universal-image-py312-cuda128-torch280-push.yaml(1 hunks)images/universal/training/py312-cuda128-torch280/Dockerfile(3 hunks)images/universal/training/py312-cuda128-torch280/Dockerfile.debug(1 hunks)images/universal/training/py312-cuda128-torch280/README.md(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- images/universal/training/py312-cuda128-torch280/README.md
- .tekton/universal-image-py312-cuda128-torch280-pull-request.yaml
🧰 Additional context used
🪛 Hadolint (2.14.0)
images/universal/training/py312-cuda128-torch280/Dockerfile.debug
[error] 75-75: Do not refer to an environment variable within the same ENV statement where it is defined.
(DL3044)
[error] 163-163: Multiple redirections compete for stdout. Use cat, tee, or pass filenames instead.
(SC2261)
images/universal/training/py312-cuda128-torch280/Dockerfile
[error] 120-120: Multiple redirections compete for stdout. Use cat, tee, or pass filenames instead.
(SC2261)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Red Hat Konflux / universal-image-py312-cuda128-torch280-on-pull-request
🔇 Additional comments (7)
.tekton/universal-image-py312-cuda128-torch280-push.yaml (1)
20-21: Reasonable timeout for long-running image builds. The 9h pipeline timeout accommodates compilation steps (causal-conv1d, mamba-ssm) while the build-container task itself is capped at 8h. This prevents runaway builds while allowing sufficient time for the full pipeline to complete.images/universal/training/py312-cuda128-torch280/Dockerfile (4)
22-23: License file handling addressed via TODO comment. The LICENSE.md copy is now commented out with a TODO note. This resolves the build-time failure from previous reviews where the file was missing. Ensure the TODO is tracked and the actual license file is added before production release.
112-115: Package and build tooling updates look reasonable. Line 112 switched tosdk@add-training-hubbranch and line 115 commented out thepip install ninjastep (relying on the later install at line 162). These changes support the Training Hub 0.3.0 upgrade goal.
167-170: Deterministic 2-step build for mamba-ssm extensions. The two-step installation (causal-conv1d first, then mamba-ssm with--no-deps) aims to avoid pip resolver conflicts. However, this pattern bypasses dependency resolution, which is risky if mamba-ssm has undeclared optional dependencies. Ensure via testing that mamba-ssm functions correctly with only causal-conv1d installed and no other implicit dependencies.To improve defensibility, add a brief comment explaining why this 2-step pattern is necessary:
+# 2-step deterministic build: install causal-conv1d as a pre-requisite, +# then mamba-ssm with --no-deps to avoid pip resolver conflicts with training-hub RUN pip install --no-build-isolation --no-cache-dir causal-conv1d==1.5.3.post1 && \ pip install --no-build-isolation --no-cache-dir mamba-ssm==2.2.6.post3 --no-deps && \ fix-permissions /opt/app-root -P
173-178: File exists; review comment concern is resolved.The verification confirms that
./images/universal/training/py312-cuda128-torch280/entrypoint-universal.shexists in the repository. The COPY command on line 173 will succeed at build time. The entrypoint wrapper setup is sound and no issues remain.images/universal/training/py312-cuda128-torch280/Dockerfile.debug (2)
318-324: Entrypoint wrapper setup aligns with runtime variant and is consistent. The COPY (line 319), ENTRYPOINT (line 323), and CMD (line 324) setup mirrors the runtime Dockerfile, preserving workbench behavior by default while enabling headless mode when a command is provided. Verify thatentrypoint-universal.shexists in the repository (same concern as runtime variant).
22-22: No issues found. The LICENSE.md file exists in the repository, so the COPY command on line 22 will execute successfully. The code is correct as-is.
Description
bump universal image to support training hub 0.3.0
Summary by CodeRabbit
New Features
Chores
Documentation